Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Chunk Job Dispatch


Contents

[ Top ]


About Job Chunking

LSF supports job chunking, where jobs with similar resource requirements submitted by the same user are grouped together for dispatch. The CHUNK_JOB_SIZE parameter in lsb.queues specifies the maximum number of jobs allowed to be dispatched together in a chunk job.

Job chunking can have the following advantages:

All of the jobs in the chunk are dispatched as a unit rather than individually. Job execution is sequential, but each chunk job member is not necessarily executed in the order it was submitted.

Chunk job candidates

Jobs with the following characteristics are typical candidates for job chunking:

Running jobs with these characteristics in normal queues can under-utilize resources because LSF spends more time scheduling and dispatching the jobs than actually running them.

Configuring a special high-priority queue for short jobs is not desirable because users may be tempted to send all of their jobs to this queue, knowing that it has high priority.

[ Top ]


Configuring a Chunk Job Dispatch

CHUNK_JOB_SIZE (lsb.queues)

To configure a queue to dispatch chunk jobs, specify the CHUNK_JOB_SIZE parameter in the queue definition in lsb.queues.

For example, the following configures a queue named chunk, which dispatches up to 4 jobs in a chunk:

Begin Queue
QUEUE_NAME     = chunk
PRIORITY       = 50
CHUNK_JOB_SIZE = 4
End Queue

After adding CHUNK_JOB_SIZE to lsb.queues, use badmin reconfig to reconfigure your cluster.

By default, CHUNK_JOB_SIZE is not enabled.

Chunk jobs and job throughput

Throughput can deteriorate if the chunk job size is too big. Performance may decrease on queues with CHUNK_JOB_SIZE greater than 30. You should evaluate the chunk job size on your own systems for best performance.

CHUNK_JOB_DURATION (lsb.params)

If CHUNK_JOB_DURATION is set in lsb.params, jobs submitted to a chunk job queue are only chunked if the job has a CPU limit or run limit set in the queue (CPULIMIT or RUNLMIT) or specified at job submission (-c or -W bsub options) that is less than or equal to the value of CHUNK_JOB_DURATION.

Jobs are not chunked if:

The value of CHUNK_JOB_DURATION is displayed by bparams -l.

After adding CHUNK_JOB_DURATION to lsb.params, use badmin reconfig to reconfigure your cluster.

By default, CHUNK_JOB_DURATION is not enabled.

Restrictions on chunk job queues

CHUNK_JOB_SIZE is ignored and jobs are not chunked for the following queues:

Jobs submitted with the corresponding bsub options are not chunked; they are dispatched individually:

[ Top ]


Submitting and Controlling Chunk Jobs

When a job is submitted to a queue configured with the CHUNK_JOB_SIZE parameter, LSF attempts to place the job in an existing chunk. A job is added to an existing chunk if it has the same characteristics as the first job in the chunk:

If a suitable host is found to run the job, but there is no chunk available with the same characteristics, LSF creates a new chunk.

Resources reserved for any member of the chunk are reserved at the time the chunk is dispatched and held until the whole chunk finishes running. Other jobs requiring the same resources are not dispatched until the chunk job is done.

For example, if all jobs in the chunk require a software license, the license is checked out and each chunk job member uses it in turn. The license is not released until the last chunk job member is finished running.

WAIT status

When sbatchd receives a chunk job, it will not start all member jobs at once. A chunk job occupies a single job slot. Even if other slots are available, the chunk job members must run one at a time in the job slot they occupy. The remaining jobs in the chunk that are waiting to run are displayed as WAIT by bjobs. Any jobs in WAIT status are included in the count of pending jobs by bqueues and busers. The bhosts command shows the single job slot occupied by the entire chunk job in the number of jobs shown in the NJOBS column.

The bhist -l command shows jobs in WAIT status as Waiting ...

The bjobs -l command does not display a WAIT reason in the list of pending jobs.

Controlling chunk jobs

Job controls affect the state of the members of a chunk job. You can perform the following actions on jobs in a chunk job:

Action (Command) Job State Effect on Job (State)
Suspend (bstop)
PEND
Removed from chunk (PSUSP)
RUN
All jobs in the chunk are suspended
(NRUN -1, NSUSP +1)
USUSP
No change
WAIT
Removed from chunk (PSUSP)
Kill (bkill)
PEND
Removed from chunk (NJOBS -1, PEND -1)
RUN
Job finishes, next job in the chunk starts if one exists (NJOBS -1, PEND -1)
USUSP
Job finishes, next job in the chunk starts if one exists (NJOBS -1, PEND -1, SUSP -1, RUN +1)
WAIT
Job finishes (NJOBS-1, PEND -1)
Resume (bresume)
USUSP
Entire chunk is resumed (RUN +1, USUSP -1)
Migrate (bmig)
WAIT
Removed from chunk
Switch queue (bswitch)
RUN
Job is removed from the chunk and switched; all other WAIT jobs are requeued to PEND
WAIT
Only the WAIT job is removed from the chunk and switched, and requeued to PEND
Checkpoint (bchkpnt)
RUN
Job is checkpointed normally
Modify (bmod)
PEND
Removed from the chunk to be scheduled later

Migrating jobs with bmig will change the dispatch sequence of the chunk job members. They will not be redispatched in the order they were originally submitted.

Rerunnable chunk jobs

If the execution host becomes unavailable, rerunnable chunk job members are removed from the queue and dispatched to a different execution host.

See Job Requeue and Job Rerun for more information about rerunnable jobs.

Checkpointing chunk jobs

Only running chunk jobs can be checkpointed. If bchkpnt -k is used, the job is also killed after the checkpoint file has been created. If chunk job in WAIT state is checkpointed, mbatchd rejects the checkpoint request.

See Job Checkpoint, Restart, and Migration for more information about checkpointing jobs.

Fairshare policies and chunk jobs

Fairshare queues can use job chunking. Jobs are accumulated in the chunk job so that priority is assigned to jobs correctly according to the fairshare policy that applies to each user. Jobs belonging to other users are dispatched in other chunks.

TERMINATE_WHEN job control action

If the TERMINATE_WHEN job control action is applied to a chunk job, sbatchd kills the chunk job element that is running and puts the rest of the waiting elements into pending state to be rescheduled later.

Enforcing resource usage limits on chunk jobs

By default, resource usage limits are not enforced for chunk jobs because chunk jobs are typically too short to allow LSF to collect resource usage.

To enforce resource limits for chunk jobs, define LSB_CHUNK_RUSAGE=Y in lsf.conf. Limits may not be enforced for chunk jobs that take less than a minute to run.

See Runtime Resource Usage Limits for more information.

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: January 12, 2004
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.