Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Working with Queues


Contents

[ Top ]


Queue States

Queue states, displayed by bqueues, describe the ability of a queue to accept and start batch jobs using a combination of the following states:

State Description
Open:Active
Accepts and starts new jobs--normal processing
Open:Inact
Accepts and holds new jobs--collecting
Closed:Active
Does not accept new jobs, but continues to start jobs-- draining
Closed:Inact
Does not accept new jobs and does not start jobs--all activity is stopped

Queue state can be changed by an LSF administrator or root.

Queues can also be activated and inactivated by run and dispatch windows (configured in lsb.queues, displayed by bqueues -l).

bqueues -l displays Inact_Adm when explicitly inactivated by an Administrator (badmin qinact), and Inact_Win when inactivated by a run or dispatch window.

[ Top ]


Viewing Queue Information

The bqueues command displays information about queues. The bqueues -l option also gives current statistics about the jobs in a particular queue such as the total number of jobs in the queue, the number of jobs running, suspended, and so on.

To view the... Run...
Available queues
bqueues
Queue status
bqueues
Detailed queue information
bqueues -l

State change history of a queue

badmin qhist
Queue administrators
bqueues -l for queue

In addition to the procedures listed here, see the bqueues(1) man page for more details.

Viewing available queues and queue status

Run bqueues. You can view the current status of a particular queue or all queues. The bqueues command also displays available queues in the cluster.

bqueues
QUEUE_NAME   PRIO  STATUS        MAX JL/U JL/P JL/H NJOBS  PEND  RUN  SUSP
interactive  400   Open:Active   -   -    -    -    2      0     2    0
priority     43    Open:Active   -   -    -    -    16     4     11   1
night        40    Open:Inactive -   -    -    -    4      4     0    0
short        35    Open:Active   -   -    -    -    6      1     5    0
license      33    Open:Active   -   -    -    -    0      0     0    0
normal       30    Open:Active   -   -    -    -    0      0     0    0
idle         20    Open:Active   -   -    -    -    6      3     1    2

A dash (-) in any entry means that the column does not apply to the row. In this example some queues have no per-queue, per-user or per-processor job limits configured, so the MAX, JL/U and JL/P entries are shown as a dash.

Viewing detailed queue information

To see the complete status and configuration for each queue, run bqueues -l. You can specify queue names on the command-line to select specific queues. In the example below, more detail is requested for the queue normal.

bqueues -l normal
QUEUE: normal
  --For normal low priority jobs, running only if hosts are lightly loaded. 
This is the default queue.
PARAMETERS/STATISTICS
PRIO NICE  STATUS      MAX JL/U JL/P NJOBS  PEND  RUN SSUSP USUSP
40   20    Open:Active 100 50   11   1      1     0   0     0
Migration threshold is 30 min.

CPULIMIT           RUNLIMIT
20 min of IBM350   342800 min of IBM350

FILELIMIT  DATALIMIT  STACKLIMIT  CORELIMIT  MEMLIMIT  PROCLIMIT
20000 K    20000 K    2048 K      20000 K    5000 K    3

SCHEDULING PARAMETERS
           r15s  r1m  r15m  ut   pg   io   ls  it  tmp  swp  mem
loadSched  -     0.7  1.0   0.2  4.0  50   -   -   -    -    -
loadStop   -     1.5  2.5   -    8.0  240  -   -   -    -    -

SCHEDULING POLICIES:  FAIRSHARE  PREEMPTIVE PREEMPTABLE EXCLUSIVE
USER_SHARES:  [groupA, 70] [groupB, 15]  [default, 1]

DEFAULT HOST SPECIFICATION : IBM350

RUN_WINDOWS:  2:40-23:00 23:30-1:30
DISPATCH_WINDOWS:  1:00-23:50

USERS: groupA/ groupB/ user5
HOSTS:  hostA, hostD, hostB
ADMINISTRATORS:  user7
PRE_EXEC: /tmp/apex_pre.x > /tmp/preexec.log 2>&1
POST_EXEC:  /tmp/apex_post.x > /tmp/postexec.log 2>&1
REQUEUE_EXIT_VALUES:  45

Viewing the state change history of a queue

Run badmin qhist to display the times when queues are opened, closed, activated, and inactivated.

% badmin qhist
Wed Mar 31 09:03:14: Queue <normal> closed by user or 
administrator <root>.
Wed Mar 31 09:03:29: Queue <normal> opened by user or 
administrator <root>.

Viewing queue administrators

Use bqueues -l for the queue.

Viewing exception status for queues (bqueues)

Use bqueues to display the configured threshold for job exceptions and the current number of jobs in the queue in each exception state.

For example, queue normal configures JOB_IDLE threshold of 0.10, JOB_OVERRUN threshold of 5 minutes, and JOB_UNDERRUN threshold of 2 minutes. The following bqueues command shows no overrun jobs, one job that finished in less than 2 minutes (underrun) and one job that triggered an idle exception (less than idle factor of 0.10):

% bqueues -l normal

QUEUE: normal
  -- For normal low priority jobs, running only if hosts are lightly loaded.  
This is the default queue.

PARAMETERS/STATISTICS
PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV 
 30   20  Open:Active       -    -    -    -     0     0     0     0     0    0

 STACKLIMIT MEMLIMIT
   2048 K     5000 K

SCHEDULING PARAMETERS
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

JOB EXCEPTION PARAMETERS 
             OVERRUN(min) UNDERRUN(min) IDLE(cputime/runtime)
 Threshold         5         2          0.10
      Jobs         0         1             1

USERS:  all users
HOSTS:  all allremote 
CHUNK_JOB_SIZE: 3

[ Top ]


Controlling Queues

Queues are controlled by an LSF Administrator or root issuing a command or through configured dispatch and run windows.

Closing a queue

Run badmin qclose:

badmin qclose normal
Queue <normal> is closed

When a user tries to submit a job to a closed queue the following message is displayed:

bsub -q normal ...
normal: Queue has been closed

Opening a queue

Run badmin qopen:

badmin qopen normal
Queue <normal> is opened

Inactivating a queue

Run badmin qinact:

badmin qinact normal
Queue <normal> is inactivated

Activating a queue

Run badmin qact:

badmin qact normal
Queue <normal> is activated

Logging a comment when controlling a queue

Use the -C option of badmin queue commands qclose, qopen, qact, and qinact to log an administrator comment in lsb.events. For example,

% badmin qclose -C "change configuration" normal

The comment text change configuration is recorded in lsb.events.

A new event record is recorded for each queue event. For example:

% badmin qclose -C "add user" normal

followed by

% badmin qclose -C "add user user1" normal

will generate records in lsb.events:

"QUEUE_CTRL" "6.0 1050082373 1 "normal" 32185 "lsfadmin" "add user"
"QUEUE_CTRL" "6.0 1050082380 1 "normal" 32185 "lsfadmin" "add user user1"

Use badmin hist or badmin qhist to display administrator comments for closing and opening hosts. For example:

% badmin qhist
Fri Apr  4 10:50:36: Queue <normal> closed by administrator 
<lsfadmin> change configuration.

bqueues -l also displays the comment text:

% bqueues -l normal

QUEUE: normal
  -- For normal low priority jobs, running only if hosts are lightly loaded.  
Th
is is the default queue.

PARAMETERS/STATISTICS
PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV
 30   20  Closed:Active     -    -    -    -     0     0     0     0     0    0
Interval for a host to accept two jobs is 0 seconds

 THREADLIMIT
      7

SCHEDULING PARAMETERS
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

JOB EXCEPTION PARAMETERS
             OVERRUN(min) UNDERRUN(min) IDLE(cputime/runtime)
 Threshold         -         2             -
      Jobs         -         0             -

USERS:  all users
HOSTS:  all
RES_REQ:  select[type==any]

ADMIN ACTION COMMENT: "change configuration"

Dispatch Windows

A dispatch window specifies one or more time periods during which batch jobs are dispatched to run on hosts. Jobs are not dispatched outside of configured windows. Dispatch windows do not affect job submission and running jobs (they are allowed to run until completion). By default, dispatch windows are not configured, queues are always Active.

To configure dispatch window:

  1. Edit lsb.queues
  2. Create a DISPATCH_WINDOW keyword for the queue and specify one or more time windows. For example:
    Begin Queue
    QUEUE_NAME   = queue1
    PRIORITY     = 45
    DISPATCH_WINDOW = 4:30-12:00
    End Queue
    
  3. Reconfigure the cluster using:
    1. lsadmin reconfig
    2. badmin reconfig
  4. Run bqueues -l to display the dispatch windows.

Run Windows

A run window specifies one or more time periods during which jobs dispatched from a queue are allowed to run. When a run window closes, running jobs are suspended, and pending jobs remain pending. The suspended jobs are resumed when the window opens again. By default, run windows are not configured, queues are always Active and jobs can run until completion.

To configure a run window:

  1. Edit lsb.queues.
  2. Create a RUN_WINDOW keyword for the queue and specify one or more time windows. For example:
    Begin Queue
    QUEUE_NAME   = queue1
    PRIORITY     = 45
    RUN_WINDOW = 4:30-12:00
    End Queue
    
  3. Reconfigure the cluster using:
    1. lsadmin reconfig.
    2. badmin reconfig.
  4. Run bqueues -l to display the run windows.

[ Top ]


Adding and Removing Queues

Adding a queue

  1. Log in as the LSF administrator on any host in the cluster.
  2. Edit lsb.queues to add the new queue definition.

    You can copy another queue definition from this file as a starting point; remember to change the QUEUE_NAME of the copied queue.

  3. Save the changes to lsb.queues.
  4. Run badmin reconfig to reconfigure mbatchd.

    Adding a queue does not affect pending or running jobs.

Removing a queue

IMPORTANT

Before removing a queue, make sure there are no jobs in that queue.

If there are jobs in the queue, move pending and running jobs to another queue, then remove the queue. If you remove a queue that has jobs in it, the jobs are temporarily moved to a queue named lost_and_found. Jobs in the lost_and_found queue remain pending until the user or the LSF administrator uses the bswitch command to switch the jobs into regular queues. Jobs in other queues are not affected.

Steps

  1. Log in as the LSF administrator on any host in the cluster.
  2. Close the queue to prevent any new jobs from being submitted. For example:
    % badmin qclose night
    Queue <night> is closed
    
  3. Move all pending and running jobs into another queue. Below, the bswitch -q night argument chooses jobs from the night queue, and the job ID number 0 specifies that all jobs should be switched:
    bjobs -u all -q night
    JOBID USER  STAT  QUEUE FROM_HOST   EXEC_HOST   JOB_NAME   
    SUBMIT_TIME
    5308  user5  RUN   night    hostA     hostD         job5  N
    ov 21 18:16
    5310  user5 PEND   night    hostA     hostC        job10  N
    ov 21 18:17
    
    % bswitch -q night idle 0
    Job <5308> is switched to queue <idle>
    Job <5310> is switched to queue <idle>
    
  4. Edit lsb.queues and remove or comment out the definition for the queue being removed.
  5. Save the changes to lsb.queues.
  6. Run badmin reconfig to reconfigure mbatchd.

[ Top ]


Managing Queues

Restricting host use by queues

You may want a host to be used only to run jobs submitted to specific queues. For example, if you just added a host for a specific department such as engineering, you may only want jobs submitted to the queues engineering1 and engineering2 to be able to run on the host.

  1. Log on as root or the LSF administrator on any host in the cluster.
  2. Edit lsb.queues, and add the host to the Hosts parameter of specific queues.
    Begin Queue
    QUEUE_NAME = queue1
    ...
    HOSTS=mynewhost hostA hostB
    ...
    End Queue
    
  3. Save the changes to lsb.queues.
  4. Use badmin ckconfig to check the new queue definition. If any errors are reported, fix the problem and check the configuration again.
  5. Run badmin reconfig to reconfigure mbatchd.
  6. If you add a host to a queue, the new host will not be recognized by jobs that were submitted before you reconfigured. If you want the new host to be recognized, you must use the command badmin mbdrestart. For more details on badmin mbdrestart, see Reconfiguring Your Cluster.

Adding queue administrators

Queue administrators are optionally configured after installation. They have limited privileges; they can perform administrative operations (open, close, activate, inactivate) on the specified queue, or on jobs running in the specified queue. Queue administrators cannot modify configuration files, or operate on LSF daemons or on queues they are not configured to administer.

To switch a job from one queue to another, you must have administrator privileges for both queues.

In the lsb.queues file, between Begin Queue and End Queue for the appropriate queue, specify the ADMINISTRATORS parameter, followed by the list of administrators for that queue. Separate the administrator names with a space. You can specify user names and group names. For example:

Begin Queue
ADMINISTRATORS = User1 GroupA
End Queue

[ Top ]


Handling Job Exceptions

You can configure queues so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected, and the corresponding actions. By default, LSF does not detect any exceptions.

eadmin script

When an exception is detected, LSF takes appropriate action by running the script LSF_SERVERDIR/eadmin on the master host. You can customize eadmin to suit the requirements of your site. For example, in some environments, a job running 1 hour would be an overrun job, while this may be a normal job in other environments. If your configuration considers jobs running longer than 1 hour to be overrun jobs, you may want to close the queue when LSF detects a job that has run longer than 1 hour and invokes eadmin. Alternatively, eadmin could find out the owner of the problem jobs and use bstop -u to stop all jobs that belong to the user.

Job exceptions LSF can detect

If you configure exception handling, LSF detects the following job exceptions:

Default eadmin actions

LSF sends email to the LSF administrator. The email contains the job ID, exception type (overrrun, underrun, idle job), and other job information.

An email is sent for all detected job exceptions according to the frequency configured by EADMIN_TRIGGER_DURATION in lsb.params. For example, if EADMIN_TRIGGER_DURATION is set to 10 minutes, and 1 overrun job and 2 idle jobs are detected, after 10 minutes, eadmin is invoked and only one email is sent. If another overrun job is detected in the next 10 minutes, another email is sent.

Configuring job exception handling (lsb.queues)

You can configure your queues to detect job exceptions. Use the following parameters:

JOB_IDLE

Specifies a threshold for idle jobs. The value should be a number between 0.0 and 1.0 representing CPU time/runtime. If the job idle factor is less than the specified threshold, LSF invokes eadmin to trigger the action for a job idle exception.

JOB_OVERRUN

Specifies a threshold for job overrun. If a job runs longer than the specified run time, LSF invokes eadmin to trigger the action for a job overrun exception.

JOB_UNDERRUN

Specifies a threshold for job underrun. If a job exits before the specified number of minutes, LSF invokes eadmin to trigger the action for a job underrun exception.

Example

The following queue defines thresholds for all job exceptions:

Begin Queue
...
JOB_UNDERRUN = 2
JOB_OVERRUN  = 5
JOB_IDLE     = 0.10
...
End Queue

For this queue:

Configuring thresholds for job exception handling

EADMIN_TRIGGER_DURATION (lsb.params)

By default, LSF checks for job exceptions every 5 minutes. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for overrun, underrun, and idle jobs.


Tune EADMIN_TRIGGER_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough.

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: January 12, 2004
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.