[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- Queue States
- Viewing Queue Information
- Controlling Queues
- Adding and Removing Queues
- Managing Queues
[ Top ]
Queue States
Queue states, displayed by
bqueues, describe the ability of a queue to accept and start batch jobs using a combination of the following states:
- Open queues accept new jobs
- Closed queues do not accept new jobs
- Active queues start jobs on available hosts
- Inactive queues hold all jobs
Queue state can be changed by an LSF administrator or
root.Queues can also be activated and inactivated by run and dispatch windows (configured in
lsb.queues, displayed bybqueues -l).
bqueues -ldisplays Inact_Adm when explicitly inactivated by an Administrator (badmin qinact), and Inact_Win when inactivated by a run or dispatch window.[ Top ]
Viewing Queue Information
The
bqueuescommand displays information about queues. Thebqueues-loption also gives current statistics about the jobs in a particular queue such as the total number of jobs in the queue, the number of jobs running, suspended, and so on.
To view the... Run... Available queues
bqueues
Queue status
bqueues
Detailed queue information
bqueues -l
State change history of a queue
badmin qhist
Queue administrators
bqueues -lfor queue
In addition to the procedures listed here, see the
bqueues(1)man page for more details.Viewing available queues and queue status
Run
bqueues. You can view the current status of a particular queue or all queues. Thebqueuescommand also displays available queues in the cluster.%bqueuesQUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP interactive 400 Open:Active - - - - 2 0 2 0 priority 43 Open:Active - - - - 16 4 11 1 night 40 Open:Inactive - - - - 4 4 0 0 short 35 Open:Active - - - - 6 1 5 0 license 33 Open:Active - - - - 0 0 0 0 normal 30 Open:Active - - - - 0 0 0 0 idle 20 Open:Active - - - - 6 3 1 2A dash (-) in any entry means that the column does not apply to the row. In this example some queues have no per-queue, per-user or per-processor job limits configured, so the
MAX,JL/UandJL/Pentries are shown as a dash.Viewing detailed queue information
To see the complete status and configuration for each queue, run
bqueues -l. You can specify queue names on the command-line to select specific queues. In the example below, more detail is requested for the queuenormal.% bqueues -l normal QUEUE: normal --For normal low priority jobs, running only if hosts are lightly loaded. This is the default queue. PARAMETERS/STATISTICS PRIO NICE STATUS MAX JL/U JL/P NJOBS PEND RUN SSUSP USUSP 40 20 Open:Active 100 50 11 1 1 0 0 0 Migration threshold is 30 min. CPULIMIT RUNLIMIT 20 min of IBM350 342800 min of IBM350 FILELIMIT DATALIMIT STACKLIMIT CORELIMIT MEMLIMIT PROCLIMIT 20000 K 20000 K 2048 K 20000 K 5000 K 3 SCHEDULING PARAMETERS r15s r1m r15m ut pg io ls it tmp swp mem loadSched - 0.7 1.0 0.2 4.0 50 - - - - - loadStop - 1.5 2.5 - 8.0 240 - - - - - SCHEDULING POLICIES: FAIRSHARE PREEMPTIVE PREEMPTABLE EXCLUSIVE USER_SHARES: [groupA, 70] [groupB, 15] [default, 1] DEFAULT HOST SPECIFICATION : IBM350 RUN_WINDOWS: 2:40-23:00 23:30-1:30 DISPATCH_WINDOWS: 1:00-23:50 USERS: groupA/ groupB/ user5 HOSTS: hostA, hostD, hostB ADMINISTRATORS: user7 PRE_EXEC: /tmp/apex_pre.x > /tmp/preexec.log 2>&1 POST_EXEC: /tmp/apex_post.x > /tmp/postexec.log 2>&1 REQUEUE_EXIT_VALUES: 45Viewing the state change history of a queue
Run
badmin qhistto display the times when queues are opened, closed, activated, and inactivated.%badmin qhistWed Mar 31 09:03:14: Queue <normal> closed by user or administrator <root>. Wed Mar 31 09:03:29: Queue <normal> opened by user or administrator <root>.Viewing queue administrators
Use
bqueues -lfor the queue.Viewing exception status for queues (bqueues)
Use
bqueuesto display the configured threshold for job exceptions and the current number of jobs in the queue in each exception state.For example, queue
normalconfigures JOB_IDLE threshold of 0.10, JOB_OVERRUN threshold of 5 minutes, and JOB_UNDERRUN threshold of 2 minutes. The followingbqueuescommand shows no overrun jobs, one job that finished in less than 2 minutes (underrun) and one job that triggered an idle exception (less than idle factor of 0.10):% bqueues -l normal QUEUE: normal -- For normal low priority jobs, running only if hosts are lightly loaded. This is the default queue. PARAMETERS/STATISTICS PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV 30 20 Open:Active - - - - 0 0 0 0 0 0 STACKLIMIT MEMLIMIT 2048 K 5000 K SCHEDULING PARAMETERS r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - JOB EXCEPTION PARAMETERS OVERRUN(min) UNDERRUN(min) IDLE(cputime/runtime) Threshold 5 2 0.10 Jobs 0 1 1 USERS: all users HOSTS: all allremote CHUNK_JOB_SIZE: 3
[ Top ]
Controlling Queues
Queues are controlled by an LSF Administrator or root issuing a command or through configured dispatch and run windows.
Closing a queue
Run
badmin qclose:%badmin qclose normalQueue <normal> is closedWhen a user tries to submit a job to a closed queue the following message is displayed:
%bsub -q normal ...normal: Queue has been closedOpening a queue
Run
badmin qopen:%badmin qopen normalQueue <normal> is openedInactivating a queue
Run
badmin qinact:%badmin qinact normalQueue <normal> is inactivatedActivating a queue
Run
badmin qact:%badmin qact normalQueue <normal> is activatedLogging a comment when controlling a queue
Use the
-Coption ofbadminqueue commandsqclose,qopen,qact, andqinactto log an administrator comment inlsb.events. For example,%badmin qclose -C "change configuration" normalThe comment text
change configurationis recorded inlsb.events.A new event record is recorded for each queue event. For example:
%badmin qclose -C "add user" normalfollowed by
%badmin qclose -C "add user user1" normalwill generate records in
lsb.events:"QUEUE_CTRL" "6.0 1050082373 1 "normal" 32185 "lsfadmin" "add user" "QUEUE_CTRL" "6.0 1050082380 1 "normal" 32185 "lsfadmin" "add user user1"Use
badmin historbadmin qhistto display administrator comments for closing and opening hosts. For example:%badmin qhistFri Apr 4 10:50:36: Queue <normal> closed by administrator <lsfadmin> change configuration.
bqueues -lalso displays the comment text:% bqueues -l normal QUEUE: normal -- For normal low priority jobs, running only if hosts are lightly loaded. Th is is the default queue. PARAMETERS/STATISTICS PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV 30 20 Closed:Active - - - - 0 0 0 0 0 0 Interval for a host to accept two jobs is 0 seconds THREADLIMIT 7 SCHEDULING PARAMETERS r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - JOB EXCEPTION PARAMETERS OVERRUN(min) UNDERRUN(min) IDLE(cputime/runtime) Threshold - 2 - Jobs - 0 - USERS: all users HOSTS: all RES_REQ: select[type==any] ADMIN ACTION COMMENT: "change configuration"Dispatch Windows
A dispatch window specifies one or more time periods during which batch jobs are dispatched to run on hosts. Jobs are not dispatched outside of configured windows. Dispatch windows do not affect job submission and running jobs (they are allowed to run until completion). By default, dispatch windows are not configured, queues are always Active.
To configure dispatch window:
- Edit
lsb.queues- Create a DISPATCH_WINDOW keyword for the queue and specify one or more time windows. For example:
Begin Queue QUEUE_NAME = queue1 PRIORITY = 45 DISPATCH_WINDOW = 4:30-12:00 End Queue- Reconfigure the cluster using:
- Run
bqueues -lto display the dispatch windows.Run Windows
A run window specifies one or more time periods during which jobs dispatched from a queue are allowed to run. When a run window closes, running jobs are suspended, and pending jobs remain pending. The suspended jobs are resumed when the window opens again. By default, run windows are not configured, queues are always Active and jobs can run until completion.
To configure a run window:
- Edit
lsb.queues.- Create a RUN_WINDOW keyword for the queue and specify one or more time windows. For example:
Begin Queue QUEUE_NAME = queue1 PRIORITY = 45 RUN_WINDOW = 4:30-12:00 End Queue- Reconfigure the cluster using:
- Run
bqueues -lto display the run windows.[ Top ]
Adding and Removing Queues
Adding a queue
- Log in as the LSF administrator on any host in the cluster.
- Edit
lsb.queuesto add the new queue definition.You can copy another queue definition from this file as a starting point; remember to change the
QUEUE_NAMEof the copied queue.- Save the changes to
lsb.queues.- Run
badmin reconfigto reconfigurembatchd.Adding a queue does not affect pending or running jobs.
Removing a queue
Before removing a queue, make sure there are no jobs in that queue.
If there are jobs in the queue, move pending and running jobs to another queue, then remove the queue. If you remove a queue that has jobs in it, the jobs are temporarily moved to a queue named
lost_and_found. Jobs in thelost_and_foundqueue remain pending until the user or the LSF administrator uses thebswitchcommand to switch the jobs into regular queues. Jobs in other queues are not affected.
- Log in as the LSF administrator on any host in the cluster.
- Close the queue to prevent any new jobs from being submitted. For example:
%badmin qclose nightQueue <night> is closed- Move all pending and running jobs into another queue. Below, the
bswitch -q nightargument chooses jobs from thenightqueue, and the job ID number0specifies that all jobs should be switched:%bjobs -u all -q nightJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user5 RUN night hostA hostD job5 N ov 21 18:16 5310 user5 PEND night hostA hostC job10 N ov 21 18:17 %bswitch -q night idle 0Job <5308> is switched to queue <idle> Job <5310> is switched to queue <idle>- Edit
lsb.queuesand remove or comment out the definition for the queue being removed.- Save the changes to
lsb.queues.- Run
badmin reconfigto reconfigurembatchd.[ Top ]
Managing Queues
Restricting host use by queues
You may want a host to be used only to run jobs submitted to specific queues. For example, if you just added a host for a specific department such as engineering, you may only want jobs submitted to the queues
engineering1andengineering2to be able to run on the host.
- Log on as root or the LSF administrator on any host in the cluster.
- Edit
lsb.queues, and add the host to theHostsparameter of specific queues.Begin Queue QUEUE_NAME = queue1 ... HOSTS=mynewhost hostA hostB ... End Queue- Save the changes to
lsb.queues.- Use
badmin ckconfigto check the new queue definition. If any errors are reported, fix the problem and check the configuration again.- Run
badmin reconfigto reconfigurembatchd.- If you add a host to a queue, the new host will not be recognized by jobs that were submitted before you reconfigured. If you want the new host to be recognized, you must use the command
badmin mbdrestart. For more details onbadmin mbdrestart, see Reconfiguring Your Cluster.Adding queue administrators
Queue administrators are optionally configured after installation. They have limited privileges; they can perform administrative operations (open, close, activate, inactivate) on the specified queue, or on jobs running in the specified queue. Queue administrators cannot modify configuration files, or operate on LSF daemons or on queues they are not configured to administer.
To switch a job from one queue to another, you must have administrator privileges for both queues.
In the
lsb.queuesfile, between Begin Queue and End Queue for the appropriate queue, specify the ADMINISTRATORS parameter, followed by the list of administrators for that queue. Separate the administrator names with a space. You can specify user names and group names. For example:Begin Queue ADMINISTRATORS = User1 GroupA End Queue
[ Top ]
Handling Job Exceptions
You can configure queues so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected, and the corresponding actions. By default, LSF does not detect any exceptions.
eadmin script
When an exception is detected, LSF takes appropriate action by running the script
LSF_SERVERDIR/eadminon the master host. You can customizeeadminto suit the requirements of your site. For example, in some environments, a job running 1 hour would be an overrun job, while this may be a normal job in other environments. If your configuration considers jobs running longer than 1 hour to be overrun jobs, you may want to close the queue when LSF detects a job that has run longer than 1 hour and invokeseadmin. Alternatively,eadmincould find out the owner of the problem jobs and usebstop -uto stop all jobs that belong to the user.Job exceptions LSF can detect
If you configure exception handling, LSF detects the following job exceptions:
- Job underrun--jobs end too soon (run time is less than expected). Underrun jobs are detected when a job exits abnormally
- Job overrun--job runs too long (run time is longer than expected)
By default, LSF checks for overrun jobs every 5 minutes. Use EADMIN_TRIGGER_DURATION in
lsb.paramsto change how frequently LSF checks for job overrun.- Idle job--running job consumes less CPU time than expected (in terms of CPU time/runtime)
By default, LSF checks for idle jobs every 5 minutes. Use EADMIN_TRIGGER_DURATION in
lsb.paramsto change how frequently LSF checks for idle jobs.Default eadmin actions
LSF sends email to the LSF administrator. The email contains the job ID, exception type (overrrun, underrun, idle job), and other job information.
An email is sent for all detected job exceptions according to the frequency configured by EADMIN_TRIGGER_DURATION in
lsb.params. For example, if EADMIN_TRIGGER_DURATION is set to 10 minutes, and 1 overrun job and 2 idle jobs are detected, after 10 minutes,eadminis invoked and only one email is sent. If another overrun job is detected in the next 10 minutes, another email is sent.Configuring job exception handling (lsb.queues)
You can configure your queues to detect job exceptions. Use the following parameters:
Specifies a threshold for idle jobs. The value should be a number between 0.0 and 1.0 representing CPU time/runtime. If the job idle factor is less than the specified threshold, LSF invokes
eadminto trigger the action for a job idle exception.Specifies a threshold for job overrun. If a job runs longer than the specified run time, LSF invokes
eadminto trigger the action for a job overrun exception.Specifies a threshold for job underrun. If a job exits before the specified number of minutes, LSF invokes
eadminto trigger the action for a job underrun exception.The following queue defines thresholds for all job exceptions:
Begin Queue ... JOB_UNDERRUN = 2 JOB_OVERRUN = 5 JOB_IDLE = 0.10 ... End QueueFor this queue:
- A job underrun exception is triggered for jobs running less than 2 minutes
- A job overrun exception is triggered for jobs running longer than 5 minutes
- A job idle exception is triggered for jobs with an idle factor (CPU time/runtime) less than 0.10
Configuring thresholds for job exception handling
EADMIN_TRIGGER_DURATION (lsb.params)
By default, LSF checks for job exceptions every 5 minutes. Use EADMIN_TRIGGER_DURATION in
lsb.paramsto change how frequently LSF checks for overrun, underrun, and idle jobs.
Tune EADMIN_TRIGGER_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough.
[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: January 12, 2004
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.