Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Managing Jobs


Contents

[ Top ]


Job States

The bjobs command displays the current state of the job.

Normal job states

Most jobs enter only three states:

Job state Description
PEND
Waiting in a queue for scheduling and dispatch
RUN
Dispatched to a host and running
DONE
Finished normally with a zero exit value

Suspended job states

If a job is suspended, it has three states:

Job state Description
PSUSP
Suspended by its owner or the LSF administrator while in PEND state
USUSP
Suspended by its owner or the LSF administrator after being dispatched
SSUSP
Suspended by the LSF system after being dispatched

State transitions

A job goes through a series of state transitions until it eventually completes its task, fails, or is terminated. The possible states of a job during its life cycle are shown in the diagram.

Viewing running jobs

Use the bjobs -r command to display running jobs.

Viewing done jobs

Use the bjobs -d command to display recently completed jobs.

Pending jobs

A job remains pending until all conditions for its execution are met. Some of the conditions are:

Viewing pending reasons

Use the bjobs -p command to display the reason why a job is pending.

Suspended jobs

A job can be suspended at any time. A job can be suspended by its owner, by the LSF administrator, by the root user (superuser), or by LSF.

After a job has been dispatched and started on a host, it can be suspended by LSF. When a job is running, LSF periodically checks the load level on the execution host. If any load index is beyond either its per-host or its per-queue suspending conditions, the lowest priority batch job on that host is suspended.

If the load on the execution host or hosts becomes too high, batch jobs could be interfering among themselves or could be interfering with interactive jobs. In either case, some jobs should be suspended to maximize host performance or to guarantee interactive response time.

LSF suspends jobs according to the priority of the job's queue. When a host is busy, LSF suspends lower priority jobs first unless the scheduling policy associated with the job dictates otherwise.

Jobs are also suspended by the system if the job queue has a run window and the current time goes outside the run window.

A system-suspended job can later be resumed by LSF if the load condition on the execution hosts falls low enough or when the closed run window of the queue opens again.

Viewing suspension reasons

Use the bjobs -s command to display the reason why a job was suspended.

WAIT state (chunk jobs)

If you have configured chunk job queues, members of a chunk job that are waiting to run are displayed as WAIT by bjobs. Any jobs in WAIT status are included in the count of pending jobs by bqueues and busers, even though the entire chunk job has been dispatched and occupies a job slot. The bhosts command shows the single job slot occupied by the entire chunk job in the number of jobs shown in the NJOBS column.

You can switch (bswitch) or migrate (bmig) a chunk job member in WAIT state to another queue.

Viewing wait status and wait reason

Use the bhist -l command to display jobs in WAIT status. Jobs are shown as Waiting ...

The bjobs -l command does not display a WAIT reason in the list of pending jobs.

See Chunk Job Dispatch for more information about chunk jobs.

Exited jobs

A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:

The job exits with a non-zero exit status.

You can configure hosts so that LSF detects an abnormally high rate of job exit from a host. See Handling Host-level Job Exceptions for more information.

Post-execution states

Some jobs may not be considered complete until some post-job processing is performed. For example, a job may need to exit from a post-execution job script, clean up job files, or transfer job output after the job completes.

The DONE or EXIT job states do not indicate whether post-processing is complete, so jobs that depend on processing may start prematurely. Use the post_done and post_err keywords on the bsub -w command to specify job dependency conditions for job post-processing. The corresponding job states POST_DONE and POST_ERR indicate the state of the post-processing.

After the job completes, you cannot perform any job control on the post- processing. Post-processing exit codes are not reported to LSF. The post- processing of a repetitive job cannot be longer than the repetition period.

Viewing post-execution states

Use the bhist command to display the POST_DONE and POST_ERR states. The resource usage of post-processing is not included in the job resource usage.

Pre-Execution and Post-Execution Commands for more information.

[ Top ]


Viewing Job Information

The bjobs command is used to display job information. By default, bjobs displays information for the user who invoked the command. For more information about bjobs, see the LSF Reference and the bjobs(1) man page.

Viewing all jobs for all users

Run bjobs -u all to display all jobs for all users. Job information is displayed in the following order:

  1. Running jobs
  2. Pending jobs in the order in which they will be scheduled
  3. Jobs in high priority queues are listed before those in lower priority queues

For example:

bjobs -u all
JOBID   USER    STAT    QUEUE     FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIM
E
1004    user1   RUN     short     hostA       hostA       job0       Dec 16 09:
23
1235    user3   PEND    priority  hostM                   job1       Dec 11 13:
55
1234    user2   SSUSP   normal    hostD       hostM       job3       Dec 11 10:
09
1250    user1   PEND    short     hostA                   job4       Dec 11 13:
59

Viewing jobs for specific users

Run bjobs -u user_name to display jobs for a specific user. For example:

% bjobs -u user1
JOBID   USER    STAT    QUEUE     FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIM
E
2225    user1   USUSP   normal    hostA                   job1       Nov 16 
11:55
2226    user1   PSUSP   normal    hostA                   job2       Nov 16 
12:30
2227    user1   PSUSP   normal    hostA                   job3       Nov 16 
12:31

Viewing exception status for jobs (bjobs)

Use bjobs to display job exceptions. bjobs -l shows exception information for unfinished jobs, and bjobs -x -l shows finished as well as unfinished jobs.

For example, the following bjobs command shows that job 2 is running longer than the configured JOB_OVERRUN threshold, and is consuming no CPU time. bjobs displays the job idle factor, and both job overrun and job idle exceptions. Job 1 finished before the configured JOB_UNDERRUN threshold, so bjobs shows exception status of underrun:

% bjobs -x -l -a
Job <2>, User <user1>, Project <default>, Status <RUN>, Queue <normal>, Command 
                     <sleep 600>
Wed Aug 13 14:23:35: Submitted from host <hostA>, CWD <$HOME>,
                     Output File </dev/null>, Specified Hosts <hostB>;
Wed Aug 13 14:23:43: Started on <hostB>, Execution Home </home/user1>, 
Execution 
                     CWD </home/user1>;
Resource usage collected.
                     IDLE_FACTOR(cputime/runtime):   0.00
                     MEM: 3 Mbytes;  SWAP: 4 Mbytes;  NTHREAD: 3
                     PGID: 5027;  PIDs: 5027 5028 5029 

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 EXCEPTION STATUS:  overrun  idle 
------------------------------------------------------------------------------

Job <1>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, 
Command
                     <sleep 20>
Wed Aug 13 14:18:00: Submitted from host <hostA>, CWD <$HOME>,
                     Output File </dev/null>, Specified Hosts <
                     hostB>;
Wed Aug 13 14:18:10: Started on <hostB>, Execution Home </home/user1>, 
Execution 
                     CWD </home/user1>;
Wed Aug 13 14:18:50: Done successfully. The CPU time used is 0.2 seconds.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 EXCEPTION STATUS:  underrun 

Use bacct -l -x to trace the history of job exceptions.

[ Top ]


Changing Job Order Within Queues

By default, LSF dispatches jobs in a queue in the order of arrival (that is, first-come-first-served), subject to availability of suitable server hosts.

Use the btop and bbot commands to change the position of pending jobs, or of pending job array elements, to affect the order in which jobs are considered for dispatch. Users can only change the relative position of their own jobs, and LSF administrators can change the position of any users' jobs.

bbot

Moves jobs relative to your last job in the queue.

If invoked by a regular user, bbot moves the selected job after the last job with the same priority submitted by the user to the queue.

If invoked by the LSF administrator, bbot moves the selected job after the last job with the same priority submitted to the queue.

btop

Moves jobs relative to your first job in the queue.

If invoked by a regular user, btop moves the selected job before the first job with the same priority submitted by the user to the queue.

If invoked by the LSF administrator, btop moves the selected job before the first job with the same priority submitted to the queue.

Moving a job to the top of the queue

In the following example, job 5311 is moved to the top of the queue. Since job 5308 is already running, job 5311 is placed in the queue after job 5308.

Note that user1's job is still in the same position on the queue. user2 cannot use btop to get extra jobs at the top of the queue; when one of his jobs moves up the queue, the rest of his jobs move down.

bjobs -u all
JOBID USER  STAT  QUEUE    FROM_HOST  EXEC_HOST  JOB_NAME   SUBMIT_TIME
5308  user2 RUN   normal   hostA      hostD      /s500     Oct 23 10:16
5309  user2 PEND  night    hostA                 /s200     Oct 23 11:04
5310  user1 PEND  night    hostB                 /myjob    Oct 23 13:45
5311  user2 PEND  night    hostA                 /s700     Oct 23 18:17

% btop 5311
Job <5311> has been moved to position 1 from top.

% bjobs -u all
JOBID USER  STAT  QUEUE    FROM_HOST  EXEC_HOST  JOB_NAME   SUBMIT_TIME
5308  user2 RUN   normal   hostA      hostD      /s500     Oct 23 10:16
5311  user2 PEND  night    hostA                 /s200     Oct 23 18:17
5310  user1 PEND  night    hostB                 /myjob    Oct 23 13:45
5309  user2 PEND  night    hostA                 /s700     Oct 23 11:04

[ Top ]


Switching Jobs from One Queue to Another

You can use the command bswitch to change jobs from one queue to another. This is useful if you submit a job to the wrong queue, or if the job is suspended because of queue thresholds or run windows and you would like to resume the job.

Switching a single job

Run bswitch to move pending and running jobs from queue to queue.

In the following example, job 5309 is switched to the priority queue:

bswitch priority 5309
Job <5309> is switched to queue <priority>
% bjobs -u all
JOBID    USER   STAT   QUEUE    FROM_HOST  EXEC_HOST   JOB_NAME   SUBMIT_TIME
5308     user2   RUN   normal   hostA      hostD       /job500    Oct 23 10:16
5309     user2   RUN   priority hostA      hostB       /job200    Oct 23 11:04
5311     user2   PEND  night    hostA                  /job700    Oct 23 18:17
5310     user1   PEND  night    hostB                  /myjob     Oct 23 13:45

Switching all jobs

Run bswitch -q from_queue to_queue 0 to switch all the jobs in a queue to another queue. The example below selects jobs from the night queue and switches them to the idle queue.

The -q option is used to operate on all jobs in a queue. The job ID number 0 specifies that all jobs from the night queue should be switched to the idle queue:

% bswitch -q night idle 0
Job <5308> is switched to queue <idle>
Job <5310> is switched to queue <idle>

[ Top ]


Forcing Job Execution

A pending job can be forced to run with the brun command. This operation can only be performed by an LSF administrator.

You can force a job to run on a particular host, to run until completion, and other restrictions. For more information, see the brun command.

When a job is forced to run, any other constraints associated with the job such as resource requirements or dependency conditions are ignored.

In this situation you may see some job slot limits, such as the maximum number of jobs that can run on a host, being violated. A job that is forced to run cannot be preempted.

Forcing a pending job to run

Run brun -m hostname job_ID to force a pending job to run. You must specify the host on which the job will run. For example, the following command will force the sequential job 104 to run on hostA:

% brun -m hostA 104

[ Top ]


Suspending and Resuming Jobs

A job can be suspended by its owner or the LSF administrator. These jobs are considered user-suspended and are displayed by bjobs as USUSP.

If a user suspends a high priority job from a non-preemptive queue, the load may become low enough for LSF to start a lower priority job in its place. The load created by the low priority job can prevent the high priority job from resuming. This can be avoided by configuring preemptive queues.

Suspending a job

Run bstop job_ID. Your job goes into USUSP state if the job is already started, or into PSUSP state if it is pending. For example:

bstop 3421
Job <3421> is being stopped

suspends job 3421.

UNIX

bstop sends the following signals to the job:

Windows

bstop causes the job to be suspended.

Resuming a job

Run bresume job_ID. For example:

bresume 3421
Job <3421> is being resumed

resumes job 3421.

Resuming a user-suspended job does not put your job into RUN state immediately. If your job was running before the suspension, bresume first puts your job into SSUSP state and then waits for sbatchd to schedule it according to the load conditions.

[ Top ]


Killing Jobs

The bkill command cancels pending batch jobs and sends signals to running jobs. By default, on UNIX, bkill sends the SIGKILL signal to running jobs.

Before SIGKILL is sent, SIGINT and SIGTERM are sent to give the job a chance to catch the signals and clean up. The signals are forwarded from mbatchd to sbatchd. sbatchd waits for the job to exit before reporting the status. Because of these delays, for a short period of time after the bkill command has been issued, bjobs may still report that the job is running.

On Windows, job control messages replace the SIGINT and SIGTERM signals, and termination is implemented by the TerminateProcess() system call.

Killing a job

Run bkill job_ID:

bkill 3421
Job <3421> is being terminated

kills job 3421.

Forcing removal of a job from LSF

Run bkill -r to force the removal of the job from LSF. Use this option when a job cannot be killed in the operating system.

The bkill -r command removes a job from the LSF system without waiting for the job to terminate in the operating system. This sends the same series of signals as bkill without -r, except that the job is removed from the system immediately, the job is marked as EXIT, and job resources that LSF monitors are released as soon as LSF receives the first signal.

[ Top ]


Sending a Signal to a Job

LSF uses signals to control jobs, to enforce scheduling policies, or in response to user requests. The principal signals LSF uses are SIGSTOP to suspend a job, SIGCONT to resume a job, and SIGKILL to terminate a job.

Occasionally, you may want to override the default actions. For example, instead of suspending a job, you might want to kill or checkpoint it. You can override the default job control actions by defining the JOB_CONTROLS parameter in your queue configuration. Each queue can have its separate job control actions.

You can also send a signal directly to a job. You cannot send arbitrary signals to a pending job; most signals are only valid for running jobs. However, LSF does allow you to kill, suspend and resume pending jobs.

You must be the owner of a job or an LSF administrator to send signals to a job.

You use the bkill -s command to send a signal to a job. If you issue bkill without the -s option, a SIGKILL signal is sent to the specified jobs to kill them. Twenty seconds before SIGKILL is sent, SIGTERM and SIGINT are sent to give the job a chance to catch the signals and clean up.

On Windows, job control messages replace the SIGINT and SIGTERM signals, but only customized applications are able to process them. Termination is implemented by the TerminateProcess() system call.

Signals on different platforms

LSF translates signal numbers across different platforms because different host types may have different signal numbering. The real meaning of a specific signal is interpreted by the machine from which the bkill command is issued.

For example, if you send signal 18 from a SunOS 4.x host, it means SIGTSTP. If the job is running on HP-UX and SIGTSTP is defined as signal number 25, LSF sends signal 25 to the job.

Sending a signal to a job

Run bkill -s signal job_id, where signal is either the signal name or the signal number. For example:

% bkill -s TSTP 3421
Job <3421> is being signaled

sends the TSTP signal to job 3421.

On most versions of UNIX, signal names and numbers are listed in the kill(1) or signal(2) man pages. On Windows, only customized applications are able to process job control messages specified with the -s option.

[ Top ]


Using Job Groups

A collection of jobs can be organized into job groups for easy management. A job group is a container for jobs in much the same way that a directory in a file system is a container for files. For example, a payroll application may have one group of jobs that calculates weekly payments, another job group for calculating monthly salaries, and a third job group that handles the salaries of part-time or contract employees. Users can submit, view, and control jobs according to their groups rather than looking at individual jobs.

Job group hierarchy

Jobs in job groups are organized into a hierarchical tree similar to the directory structure of a file system. Like a file system, the tree contains groups (which are like directories) and jobs (which are like files). Each group can contain other groups or individual jobs. Job groups are created independently of jobs, and can have dependency conditions which control when jobs within the group are considered for scheduling.

Job group path

The job group path is the name and location of a job group within the job group hierarchy. Multiple levels of job groups can be defined to form a hierarchical tree. A job group can contain jobs and sub-groups.

Root job group

LSF maintains a single tree under which all jobs in the system are organized. The top-most level of the tree is represented by a top-level "root" job group, named "/". The root group is owned by the primary LSF Administrator and cannot be removed. Users create new groups under the root group. By default, if you do not specify a job group path name when submitting a job, the job is created under the top-level "root" job group, named "/".

Job group owner

Each group is owned by the user who created it. The login name of the user who creates the job group is the job group owner. Users can add job groups into a groups that are owned by other users, and they can submit jobs to groups owned by other users.

Job control under job groups

Job owners can control their own jobs attached to job groups as usual. Job group owners can also control any job under the groups they own and below.

For example:

All users can submit jobs to any job group, and control the jobs they own in all job groups. For jobs submitted by other users:

The LSF administrator can control jobs in any job group.

Creating a job group

Use the bgadd command to create a new job group. You must provide full group path name for the new job group. The last component of the path is the name of the new group to be created:

If the group hierarchy /risk_group/portfolio1/current does not exist, LSF checks its parent recursively, and if no groups in the hierarchy exist, all three job groups are created with the specified hierarchy.

Submitting jobs under a job group

Use the -g option of bsub to submit a job into a job group. The job group does not have to exist before submitting the job. For example:

% bsub -g /risk_group/portfolio1/current myjob
Job <105> is submitted to default queue.

Submits myjob to the job group /risk_group/portfolio1/current.

If group /risk_group/portfolio1/current exists, job 105 is attached to the job group.

If group /risk_group/portfolio1/current does not exist, LSF checks its parent recursively, and if no groups in the hierarchy exist, all three job groups are created with the specified hierarchy and the job is attached to group.

-g and -sla options

You cannot use the -g option with -sla. A job can either be attached to a job group or a service class, but not both.

Viewing jobs in job groups

bjgroup command

Use the bjgroup command to see information about jobs in specific job groups.

% bjgroup
GROUP_NAME         NJOBS   PEND    RUN    SSUSP  USUSP  FINISH
/fund1_grp          5       4       0      1      0      0
/fund2_grp          11      2       5      0      0      4
/bond_grp           2       2       0      0      0      0
/risk_grp           2       1       1      0      0      0
/admi_grp           4       4       0      0      0      0

bjobs command

Use the -g option of bjobs and specify a job group path to view jobs attached to the specified group.

% bjobs -g /risk_group
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
113     user1   PEND  normal     hostA                   myjob     Jun 17 16:15
111     user2   RUN   normal     hostA       hostA       myjob     Jun 14 15:13
110     user1   RUN   normal     hostB       hostA       myjob     Jun 12 05:03
104     user3   RUN   normal     hostA       hostC       myjob     Jun 11 13:18

bjobs -l displays the full path to the group to which a job is attached:

% bjobs -l -g /risk_group

Job <101>, User <user1>, Project <default>, Job Group 
</risk_group>, Status <RUN>, Queue <normal>, Command <myjob>
Tue Jun 17 16:21:49: Submitted from host <hostA>, CWD 
</home/user1;
Tue Jun 17 16:22:01: Started on <hostA>;
...

Controlling jobs in job groups

Stopping (bstop)

Use the -g option of bstop and specify a job group path to suspend jobs in a job group

% bstop -g /risk_group 106
Job <106> is being stopped

Use job ID 0 (zero) to suspend all jobs in a job group:

% bstop -g /risk_group/consolidate 0
Job <107> is being stopped
Job <108> is being stopped
Job <109> is being stopped

Resuming (bresume)

Use the -g option of bresume and specify a job group path to resume suspended jobs in a job group:

% bresume -g /risk_group 106
Job <106> is being resumed

Use job ID 0 (zero) to resume all jobs in a job group:

% bresume -g /risk_group 0
Job <109> is being resumed
Job <110> is being resumed
Job <112> is being resumed

Modifying (bmod)

Use the -g option of bmod and specify a job group path to move a job or a job array from one job group to another. For example:

% bmod -g /risk_group/portfolio2/monthly 105

moves job 105 to job group /risk_group/portfolio2/monthly.

Like bsub -g, if the job group does not exist, LSF creates it.

bmod -g cannot be combined with other bmod options. It can operate on finished, running, and pending jobs.

You can modify your own job groups and job groups that other users create under your job groups. The LSF administrator can modify job groups of all users.

You cannot move job array elements from one job group to another, only entire job arrays. A job array can only belong to one job group at a time. You cannot modify the job group of a job attached to a service class.

bhist -l shows job group modification information:

% bhist -l 105

Job <105>, User <user1>, Project <default>, Job Group </risk_group>, Command 
<myjob>
                     
Wed May 14 15:24:07: Submitted from host <hostA>, to Queue <normal>, CWD
<$HOME/lsf51/5.1/sparc-sol7-64/bin>;
Wed May 14 15:24:10: Parameters of Job are changed:
                         Job group changes to: /risk_group/portfolio2/monthly;
Wed May 14 15:24:17: Dispatched to <hostA>;
Wed May 14 15:24:17: Starting (Pid 8602);
...

Terminating (bkill)

Use the -g option of bkill and specify a job group path to terminate jobs in a job group. For example,

% bkill -g /risk_group 106
Job <106> is being terminated

Use job ID 0 (zero) to terminate all jobs in a job group:

% bkill -g /risk_group 0
Job <1413> is being terminated
Job <1414> is being terminated
Job <1415> is being terminated
Job <1416> is being terminated

bkill only kills jobs in the job group you specify. It does not kill jobs in lower level job groups in the path. For example, jobs are attached to job groups /risk_group and /risk_group/consolidate:

% bsub -g /risk_group  myjob
Job <115> is submitted to default queue <normal>.
% bsub -g /risk_group/consolidate myjob2
Job <116> is submitted to default queue <normal>.

The following bkill command only kills jobs in /risk_group, not the subgroup /risk_group/consolidate:

% bkill -g /risk_group 0
Job <115> is being terminated
% bkill -g /risk_group/consolidate 0
Job <116> is being terminated

Deleting (bgdel)

Use bgdel command to remove a job group. The job group cannot contain any jobs. For example:

% bgdel /risk_group
Job group /risk_group is deleted.

deletes the job group /risk_group and all its subgroups.

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: January 12, 2004
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.