[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- Job States
- Viewing Job Information
- Changing Job Order Within Queues
- Switching Jobs from One Queue to Another
- Forcing Job Execution
- Suspending and Resuming Jobs
- Killing Jobs
- Sending a Signal to a Job
- Using Job Groups
[ Top ]
Job States
The
bjobscommand displays the current state of the job.Most jobs enter only three states:
Job state Description PEND
Waiting in a queue for scheduling and dispatch
RUN
Dispatched to a host and running
DONE
Finished normally with a zero exit value
If a job is suspended, it has three states:
A job goes through a series of state transitions until it eventually completes its task, fails, or is terminated. The possible states of a job during its life cycle are shown in the diagram.
![]()
Use the
bjobs -rcommand to display running jobs.Use the
bjobs -dcommand to display recently completed jobs.Pending jobs
A job remains pending until all conditions for its execution are met. Some of the conditions are:
- Start time specified by the user when the job is submitted
- Load conditions on qualified hosts
- Dispatch windows during which the queue can dispatch and qualified hosts can accept jobs
- Run windows during which jobs from the queue can run
- Limits on the number of job slots configured for a queue, a host, or a user
- Relative priority to other users and jobs
- Availability of the specified resources
- Job dependency and pre-execution conditions
Use the
bjobs -pcommand to display the reason why a job is pending.Suspended jobs
A job can be suspended at any time. A job can be suspended by its owner, by the LSF administrator, by the root user (superuser), or by LSF.
After a job has been dispatched and started on a host, it can be suspended by LSF. When a job is running, LSF periodically checks the load level on the execution host. If any load index is beyond either its per-host or its per-queue suspending conditions, the lowest priority batch job on that host is suspended.
If the load on the execution host or hosts becomes too high, batch jobs could be interfering among themselves or could be interfering with interactive jobs. In either case, some jobs should be suspended to maximize host performance or to guarantee interactive response time.
LSF suspends jobs according to the priority of the job's queue. When a host is busy, LSF suspends lower priority jobs first unless the scheduling policy associated with the job dictates otherwise.
Jobs are also suspended by the system if the job queue has a run window and the current time goes outside the run window.
A system-suspended job can later be resumed by LSF if the load condition on the execution hosts falls low enough or when the closed run window of the queue opens again.
Viewing suspension reasons
Use the
bjobs -scommand to display the reason why a job was suspended.WAIT state (chunk jobs)
If you have configured chunk job queues, members of a chunk job that are waiting to run are displayed as
WAITbybjobs. Any jobs inWAITstatus are included in the count of pending jobs bybqueuesandbusers, even though the entire chunk job has been dispatched and occupies a job slot. Thebhostscommand shows the single job slot occupied by the entire chunk job in the number of jobs shown in the NJOBS column.You can switch (
bswitch) or migrate (bmig) a chunk job member inWAITstate to another queue.Use the
bhist -lcommand to display jobs inWAITstatus. Jobs are shown asWaiting ...The
bjobs -lcommand does not display aWAITreason in the list of pending jobs.See Chunk Job Dispatch for more information about chunk jobs.
Exited jobs
A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:
- The job is cancelled by its owner or the LSF administrator while pending, or after being dispatched to a host.
- The job is not able to be dispatched before it reaches its termination deadline, and thus is aborted by LSF.
- The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted.
The job exits with a non-zero exit status.
You can configure hosts so that LSF detects an abnormally high rate of job exit from a host. See Handling Host-level Job Exceptions for more information.
Post-execution states
Some jobs may not be considered complete until some post-job processing is performed. For example, a job may need to exit from a post-execution job script, clean up job files, or transfer job output after the job completes.
The DONE or EXIT job states do not indicate whether post-processing is complete, so jobs that depend on processing may start prematurely. Use the
post_doneandpost_errkeywords on thebsub -wcommand to specify job dependency conditions for job post-processing. The corresponding job states POST_DONE and POST_ERR indicate the state of the post-processing.After the job completes, you cannot perform any job control on the post- processing. Post-processing exit codes are not reported to LSF. The post- processing of a repetitive job cannot be longer than the repetition period.
Use the
bhistcommand to display the POST_DONE and POST_ERR states. The resource usage of post-processing is not included in the job resource usage.Pre-Execution and Post-Execution Commands for more information.
[ Top ]
Viewing Job Information
The
bjobscommand is used to display job information. By default,bjobsdisplays information for the user who invoked the command. For more information aboutbjobs, see the LSF Reference and thebjobs(1)man page.Viewing all jobs for all users
Run
bjobs -u allto display all jobs for all users. Job information is displayed in the following order:
- Running jobs
- Pending jobs in the order in which they will be scheduled
- Jobs in high priority queues are listed before those in lower priority queues
For example:
%bjobs -u allJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIM E 1004 user1 RUN short hostA hostA job0 Dec 16 09: 23 1235 user3 PEND priority hostM job1 Dec 11 13: 55 1234 user2 SSUSP normal hostD hostM job3 Dec 11 10: 09 1250 user1 PEND short hostA job4 Dec 11 13: 59Viewing jobs for specific users
Run
bjobs-uuser_name to display jobs for a specific user. For example:%bjobs -u user1JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIM E 2225 user1 USUSP normal hostA job1 Nov 16 11:55 2226 user1 PSUSP normal hostA job2 Nov 16 12:30 2227 user1 PSUSP normal hostA job3 Nov 16 12:31Viewing exception status for jobs (bjobs)
Use
bjobsto display job exceptions.bjobs -lshows exception information for unfinished jobs, andbjobs -x -lshows finished as well as unfinished jobs.For example, the following
bjobscommand shows that job 2 is running longer than the configured JOB_OVERRUN threshold, and is consuming no CPU time.bjobsdisplays the job idle factor, and both job overrun and job idle exceptions. Job 1 finished before the configured JOB_UNDERRUN threshold, sobjobsshows exception status of underrun:% bjobs -x -l -a Job <2>, User <user1>, Project <default>, Status <RUN>, Queue <normal>, Command <sleep 600> Wed Aug 13 14:23:35: Submitted from host <hostA>, CWD <$HOME>, Output File </dev/null>, Specified Hosts <hostB>; Wed Aug 13 14:23:43: Started on <hostB>, Execution Home </home/user1>, Execution CWD </home/user1>; Resource usage collected. IDLE_FACTOR(cputime/runtime): 0.00 MEM: 3 Mbytes; SWAP: 4 Mbytes; NTHREAD: 3 PGID: 5027; PIDs: 5027 5028 5029 SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - EXCEPTION STATUS: overrun idle ------------------------------------------------------------------------------ Job <1>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Command <sleep 20> Wed Aug 13 14:18:00: Submitted from host <hostA>, CWD <$HOME>, Output File </dev/null>, Specified Hosts < hostB>; Wed Aug 13 14:18:10: Started on <hostB>, Execution Home </home/user1>, Execution CWD </home/user1>; Wed Aug 13 14:18:50: Done successfully. The CPU time used is 0.2 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - EXCEPTION STATUS: underrunUse
bacct -l -xto trace the history of job exceptions.[ Top ]
Changing Job Order Within Queues
By default, LSF dispatches jobs in a queue in the order of arrival (that is, first-come-first-served), subject to availability of suitable server hosts.
Use the
btopandbbotcommands to change the position of pending jobs, or of pending job array elements, to affect the order in which jobs are considered for dispatch. Users can only change the relative position of their own jobs, and LSF administrators can change the position of any users' jobs.bbot
Moves jobs relative to your last job in the queue.
If invoked by a regular user,
bbotmoves the selected job after the last job with the same priority submitted by the user to the queue.If invoked by the LSF administrator,
bbotmoves the selected job after the last job with the same priority submitted to the queue.btop
Moves jobs relative to your first job in the queue.
If invoked by a regular user,
btopmoves the selected job before the first job with the same priority submitted by the user to the queue.If invoked by the LSF administrator,
btopmoves the selected job before the first job with the same priority submitted to the queue.Moving a job to the top of the queue
In the following example, job 5311 is moved to the top of the queue. Since job 5308 is already running, job 5311 is placed in the queue after job 5308.
Note that
user1's job is still in the same position on the queue.user2cannot usebtopto get extra jobs at the top of the queue; when one of his jobs moves up the queue, the rest of his jobs move down.%bjobs -u allJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD /s500 Oct 23 10:16 5309 user2 PEND night hostA /s200 Oct 23 11:04 5310 user1 PEND night hostB /myjob Oct 23 13:45 5311 user2 PEND night hostA /s700 Oct 23 18:17 %btop 5311Job <5311> has been moved to position 1 from top. %bjobs -u allJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD /s500 Oct 23 10:16 5311 user2 PEND night hostA /s200 Oct 23 18:17 5310 user1 PEND night hostB /myjob Oct 23 13:45 5309 user2 PEND night hostA /s700 Oct 23 11:04[ Top ]
Switching Jobs from One Queue to Another
You can use the command
bswitchto change jobs from one queue to another. This is useful if you submit a job to the wrong queue, or if the job is suspended because of queue thresholds or run windows and you would like to resume the job.Switching a single job
Run
bswitchto move pending and running jobs from queue to queue.In the following example, job 5309 is switched to the
priorityqueue:%bswitch priority 5309Job <5309> is switched to queue <priority> % bjobs -u all JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD /job500 Oct 23 10:16 5309 user2 RUN priority hostA hostB /job200 Oct 23 11:04 5311 user2 PEND night hostA /job700 Oct 23 18:17 5310 user1 PEND night hostB /myjob Oct 23 13:45Switching all jobs
Run
bswitch -qfrom_queueto_queue0to switch all the jobs in a queue to another queue. The example below selects jobs from thenightqueue and switches them to theidlequeue.The
-qoption is used to operate on all jobs in a queue. The job ID number 0 specifies that all jobs from the night queue should be switched to the idle queue:%bswitch -q night idle 0Job <5308> is switched to queue <idle> Job <5310> is switched to queue <idle>[ Top ]
Forcing Job Execution
A pending job can be forced to run with the
bruncommand. This operation can only be performed by an LSF administrator.You can force a job to run on a particular host, to run until completion, and other restrictions. For more information, see the
bruncommand.When a job is forced to run, any other constraints associated with the job such as resource requirements or dependency conditions are ignored.
In this situation you may see some job slot limits, such as the maximum number of jobs that can run on a host, being violated. A job that is forced to run cannot be preempted.
Forcing a pending job to run
Run
brun -mhostnamejob_ID to force a pending job to run. You must specify the host on which the job will run. For example, the following command will force the sequential job 104 to run onhostA:%brun -m hostA 104[ Top ]
Suspending and Resuming Jobs
A job can be suspended by its owner or the LSF administrator. These jobs are considered user-suspended and are displayed by
bjobsasUSUSP.If a user suspends a high priority job from a non-preemptive queue, the load may become low enough for LSF to start a lower priority job in its place. The load created by the low priority job can prevent the high priority job from resuming. This can be avoided by configuring preemptive queues.
Suspending a job
Run
bstopjob_ID.Your job goes intoUSUSPstate if the job is already started, or intoPSUSPstate if it is pending. For example:%bstop 3421Job <3421> is being stoppedsuspends job 3421.
bstopsends the following signals to the job:
SIGTSTPfor parallel or interactive jobs
SIGTSTPis caught by the master process and passed to all the slave processes running on other hosts.SIGSTOPfor sequential jobs
SIGSTOPcannot be caught by user programs. TheSIGSTOPsignal can be configured with the LSB_SIGSTOP parameter inlsf.conf.
bstopcauses the job to be suspended.Resuming a job
Run
bresumejob_ID. For example:%bresume 3421Job <3421> is being resumedresumes job 3421.
Resuming a user-suspended job does not put your job into
RUNstate immediately. If your job was running before the suspension,bresumefirst puts your job intoSSUSPstate and then waits forsbatchdto schedule it according to the load conditions.[ Top ]
Killing Jobs
The
bkillcommand cancels pending batch jobs and sends signals to running jobs. By default, on UNIX,bkillsends theSIGKILLsignal to running jobs.Before
SIGKILLis sent,SIGINTandSIGTERMare sent to give the job a chance to catch the signals and clean up. The signals are forwarded frommbatchdtosbatchd.sbatchdwaits for the job to exit before reporting the status. Because of these delays, for a short period of time after thebkillcommand has been issued,bjobsmay still report that the job is running.On Windows, job control messages replace the
SIGINTandSIGTERMsignals, and termination is implemented by theTerminateProcess()system call.Killing a job
Run
bkilljob_ID:%bkill 3421Job <3421> is being terminatedkills job 3421.
Forcing removal of a job from LSF
Run
bkill -rto force the removal of the job from LSF. Use this option when a job cannot be killed in the operating system.The
bkill -rcommand removes a job from the LSF system without waiting for the job to terminate in the operating system. This sends the same series of signals asbkillwithout -r, except that the job is removed from the system immediately, the job is marked as EXIT, and job resources that LSF monitors are released as soon as LSF receives the first signal.[ Top ]
Sending a Signal to a Job
LSF uses signals to control jobs, to enforce scheduling policies, or in response to user requests. The principal signals LSF uses are
SIGSTOPto suspend a job,SIGCONTto resume a job, andSIGKILLto terminate a job.Occasionally, you may want to override the default actions. For example, instead of suspending a job, you might want to kill or checkpoint it. You can override the default job control actions by defining the JOB_CONTROLS parameter in your queue configuration. Each queue can have its separate job control actions.
You can also send a signal directly to a job. You cannot send arbitrary signals to a pending job; most signals are only valid for running jobs. However, LSF does allow you to kill, suspend and resume pending jobs.
You must be the owner of a job or an LSF administrator to send signals to a job.
You use the
bkill -scommand to send a signal to a job. If you issuebkillwithout the -soption, aSIGKILLsignal is sent to the specified jobs to kill them. Twenty seconds beforeSIGKILLis sent,SIGTERMandSIGINTare sent to give the job a chance to catch the signals and clean up.On Windows, job control messages replace the
SIGINTandSIGTERMsignals, but only customized applications are able to process them. Termination is implemented by theTerminateProcess()system call.Signals on different platforms
LSF translates signal numbers across different platforms because different host types may have different signal numbering. The real meaning of a specific signal is interpreted by the machine from which the
bkillcommand is issued.For example, if you send signal 18 from a SunOS 4.x host, it means
SIGTSTP. If the job is running on HP-UX andSIGTSTPis defined as signal number 25, LSF sends signal 25 to the job.Sending a signal to a job
Run
bkill-ssignal job_id, where signal is either the signal name or the signal number. For example:%bkill -s TSTP 3421Job <3421> is being signaledsends the
TSTPsignal to job 3421.On most versions of UNIX, signal names and numbers are listed in the
kill(1) orsignal(2)man pages. On Windows, only customized applications are able to process job control messages specified with the-soption.[ Top ]
Using Job Groups
A collection of jobs can be organized into job groups for easy management. A job group is a container for jobs in much the same way that a directory in a file system is a container for files. For example, a payroll application may have one group of jobs that calculates weekly payments, another job group for calculating monthly salaries, and a third job group that handles the salaries of part-time or contract employees. Users can submit, view, and control jobs according to their groups rather than looking at individual jobs.
Jobs in job groups are organized into a hierarchical tree similar to the directory structure of a file system. Like a file system, the tree contains groups (which are like directories) and jobs (which are like files). Each group can contain other groups or individual jobs. Job groups are created independently of jobs, and can have dependency conditions which control when jobs within the group are considered for scheduling.
The job group path is the name and location of a job group within the job group hierarchy. Multiple levels of job groups can be defined to form a hierarchical tree. A job group can contain jobs and sub-groups.
LSF maintains a single tree under which all jobs in the system are organized. The top-most level of the tree is represented by a top-level "root" job group, named "
/". The root group is owned by the primary LSF Administrator and cannot be removed. Users create new groups under the root group. By default, if you do not specify a job group path name when submitting a job, the job is created under the top-level "root" job group, named "/".Each group is owned by the user who created it. The login name of the user who creates the job group is the job group owner. Users can add job groups into a groups that are owned by other users, and they can submit jobs to groups owned by other users.
Job owners can control their own jobs attached to job groups as usual. Job group owners can also control any job under the groups they own and below.
For example:
- Job group
/Ais created byuser1- Job group
/A/Bis created byuser2- Job group
/A/B/Cis created byuser3All users can submit jobs to any job group, and control the jobs they own in all job groups. For jobs submitted by other users:
user1can control jobs submitted by other users in all 3 job groups:/A,/A/B, and/A/B/Cuser2can control jobs submitted by other users only in 2 job groups:/A/Band/A/B/Cuser3can control jobs submitted by other users only in job group/A/B/CThe LSF administrator can control jobs in any job group.
Creating a job group
Use the
bgaddcommand to create a new job group. You must provide full group path name for the new job group. The last component of the path is the name of the new group to be created:
% bgadd /risk_groupcreates a job group named
risk_groupunder the root group/.% bgadd /risk_group/portfolio1creates a job group named
portfolio1under job group/risk_group.% bgadd /risk_group/portfolio1/currentcreates a job group named
currentunder job group/risk_group/portfolio1.If the group hierarchy
/risk_group/portfolio1/currentdoes not exist, LSF checks its parent recursively, and if no groups in the hierarchy exist, all three job groups are created with the specified hierarchy.Submitting jobs under a job group
Use the
-goption ofbsubto submit a job into a job group. The job group does not have to exist before submitting the job. For example:% bsub -g /risk_group/portfolio1/current myjob Job <105> is submitted to default queue.Submits
myjobto the job group/risk_group/portfolio1/current.If group
/risk_group/portfolio1/currentexists, job 105 is attached to the job group.If group
/risk_group/portfolio1/currentdoes not exist, LSF checks its parent recursively, and if no groups in the hierarchy exist, all three job groups are created with the specified hierarchy and the job is attached to group.You cannot use the
-goption with-sla. A job can either be attached to a job group or a service class, but not both.Viewing jobs in job groups
Use the
bjgroupcommand to see information about jobs in specific job groups.% bjgroup GROUP_NAME NJOBS PEND RUN SSUSP USUSP FINISH /fund1_grp 5 4 0 1 0 0 /fund2_grp 11 2 5 0 0 4 /bond_grp 2 2 0 0 0 0 /risk_grp 2 1 1 0 0 0 /admi_grp 4 4 0 0 0 0Use the
-goption ofbjobsand specify a job group path to view jobs attached to the specified group.% bjobs -g /risk_group JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 113 user1 PEND normal hostA myjob Jun 17 16:15 111 user2 RUN normal hostA hostA myjob Jun 14 15:13 110 user1 RUN normal hostB hostA myjob Jun 12 05:03 104 user3 RUN normal hostA hostC myjob Jun 11 13:18
bjobs -ldisplays the full path to the group to which a job is attached:% bjobs -l -g /risk_group Job <101>, User <user1>, Project <default>, Job Group </risk_group>, Status <RUN>, Queue <normal>, Command <myjob> Tue Jun 17 16:21:49: Submitted from host <hostA>, CWD </home/user1; Tue Jun 17 16:22:01: Started on <hostA>; ...Controlling jobs in job groups
Use the
-goption ofbstopand specify a job group path to suspend jobs in a job group% bstop -g /risk_group 106 Job <106> is being stoppedUse job ID 0 (zero) to suspend all jobs in a job group:
% bstop -g /risk_group/consolidate 0 Job <107> is being stopped Job <108> is being stopped Job <109> is being stoppedUse the
-goption ofbresumeand specify a job group path to resume suspended jobs in a job group:% bresume -g /risk_group 106 Job <106> is being resumedUse job ID 0 (zero) to resume all jobs in a job group:
% bresume -g /risk_group 0 Job <109> is being resumed Job <110> is being resumed Job <112> is being resumedUse the
-goption ofbmodand specify a job group path to move a job or a job array from one job group to another. For example:% bmod -g /risk_group/portfolio2/monthly 105moves job 105 to job group
/risk_group/portfolio2/monthly.Like
bsub -g, if the job group does not exist, LSF creates it.
bmod -gcannot be combined with otherbmodoptions. It can operate on finished, running, and pending jobs.You can modify your own job groups and job groups that other users create under your job groups. The LSF administrator can modify job groups of all users.
You cannot move job array elements from one job group to another, only entire job arrays. A job array can only belong to one job group at a time. You cannot modify the job group of a job attached to a service class.
bhist -lshows job group modification information:% bhist -l 105 Job <105>, User <user1>, Project <default>, Job Group </risk_group>, Command <myjob> Wed May 14 15:24:07: Submitted from host <hostA>, to Queue <normal>, CWD <$HOME/lsf51/5.1/sparc-sol7-64/bin>; Wed May 14 15:24:10: Parameters of Job are changed: Job group changes to: /risk_group/portfolio2/monthly; Wed May 14 15:24:17: Dispatched to <hostA>; Wed May 14 15:24:17: Starting (Pid 8602); ...Use the
-goption ofbkilland specify a job group path to terminate jobs in a job group. For example,% bkill -g /risk_group 106 Job <106> is being terminatedUse job ID 0 (zero) to terminate all jobs in a job group:
% bkill -g /risk_group 0 Job <1413> is being terminated Job <1414> is being terminated Job <1415> is being terminated Job <1416> is being terminated
bkillonly kills jobs in the job group you specify. It does not kill jobs in lower level job groups in the path. For example, jobs are attached to job groups/risk_groupand/risk_group/consolidate:% bsub -g /risk_group myjob Job <115> is submitted to default queue <normal>. % bsub -g /risk_group/consolidate myjob2 Job <116> is submitted to default queue <normal>.The following
bkillcommand only kills jobs in/risk_group, not the subgroup/risk_group/consolidate:% bkill -g /risk_group 0 Job <115> is being terminated % bkill -g /risk_group/consolidate 0 Job <116> is being terminatedUse
bgdelcommand to remove a job group. The job group cannot contain any jobs. For example:% bgdel /risk_group Job group /risk_group is deleted.deletes the job group
/risk_groupand all its subgroups.[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: January 12, 2004
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.