Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Runtime Resource Usage Limits


Contents

[ Top ]


About Resource Usage Limits

Resource usage limits control how much resource can be consumed by running jobs. Jobs that use more than the specified amount of a resource are signalled or have their priority lowered.

Limits can be specified either at the queue level by your LSF administrator (lsb.queues) or at the job level when you submit a job.

For example, by defining a high-priority short queue, you can allow short jobs to be scheduled earlier than long jobs. To prevent some users from submitting long jobs to this short queue, you can set CPU limit for the queue so that no jobs submitted from the queue can run for longer than that limit.

Limits specified at the queue level are hard limits, while those specified with job submission are soft limits. See setrlimit(2) man page for concepts of hard and soft limits.

Resource usage limits and resource allocation limits

Resource usage limits are not the same as resource allocation limits, which are enforced during job scheduling and before jobs are dispatched. You set resource allocation limits to restrict the amount of a given resource that must be available during job scheduling for different classes of jobs to start, and which resource consumers the limits apply to.

See Resource Allocation Limits for more information.

Summary of resource usage limits

Limit Job syntax (bsub) Queue syntax (lsb.queues) Fomat/Units
Core file size limit
-C core_limit
CORELIMIT=limit
integer KB
CPU time limit
-c cpu_limit
CPULIMIT=[default] 
maximum
[hours:]minutes[/host_name | /host_model]
Data segment size limit
-D data_limit
DATALIMIT=[default] 
maximum
integer KB
File size limit
-F file_limit
FILELIMIT=limit
integer KB
Memory limit
-M mem_limit
MEMLIMIT=[default] 
maximum
integer KB
Process limit
-p process_limit
PROCESSLIMIT=[defaul
t] maximum
integer KB
Run time limit
-W run_limit
RUNLIMIT=[default] 
maximum
[hours:]minutes[/host_name | /host_model]
Stack segment size limit
-S stack_limit
STACKLIMIT=limit
integer KB
Virtual memory limit
-v swap_limit
SWAPLIMIT=limit
integer KB
Thread limit
-T thread_limit
THREADLIMIT=[default
] maximum
integer

Priority of resource usage limits

If no limit is specified at job submission, then the following apply to all jobs submitted to the queue:

If ... Then ...
Both default and maximum limits are defined
The default is enforced
Only a maximum is defined
The maximum is enforced
No limit is specified in the queue or at job submission
No limits are enforced

Incorrect resource usage limits

Incorrect limits are ignored, and a warning message is displayed when the cluster is reconfigured or restarted. A warning message is also logged to the mbatchd log file when LSF is started.

If no limit is specified at job submission, then the following apply to all jobs submitted to the queue:

If ... Then ...
The default limit is incorrect
The default is ignored and the maximum limit is enforced
Both default and maximum limits are specified, and the maximum is incorrect
The maximum is ignored and the resource has no maximum limit, only a default limit
Both default and maximum limits are incorrect
The default and maximum are ignored and no limit is enforced

Resource usage limits specified at job submission must be less than the maximum specified in lsb.queues. The job submission is rejected if the user- specified limit is greater than the queue-level maximum, and the following message is issued:

Cannot exceed queue's hard limit(s). Job not submitted.

Enforcing limits on chunk jobs

By default, resource usage limits are not enforced for chunk jobs because chunk jobs are typically too short to allow LSF to collect resource usage.

To enforce resource limits for chunk jobs, define LSB_CHUNK_RUSAGE=Y in lsf.conf. Limits may not be enforced for chunk jobs that take less than a minute to run.

[ Top ]


Specifying Resource Usage Limits

Queues can enforce resource usage limits on running jobs. LSF supports most of the limits that the underlying operating system supports. In addition, LSF also supports a few limits that the underlying operating system does not support.

Specify queue-level resource usage limits using parameters in lsb.queues.

Specifying queue-level resource usage limits

Limits configured in lsb.queues apply to all jobs submitted to the queue. Job- level resource usage limits specified at job submission override the queue definitions.

Maximum value only

Specify only a maximum value for the resource.

For example, to specify a maximum run limit, use one value for the RUNLIMIT parameter in lsb.queues:

RUNLIMIT = 10

The maximum run limit for the queue is 10 minutes. Jobs cannot run for more than 10 minutes. Jobs in the RUN state for longer than 10 minutes are killed by LSF.

If only one run limit is specified, jobs that are submitted with bsub -W with a run limit that exceeds the maximum run limit will not be allowed to run. Jobs submitted without bsub -W will be allowed to run but will be killed when they are in the RUN state for longer than the specified maximum run limit.

For example, in lsb.queues:

RUNLIMIT = 10

The maximum run limit for the queue is 10 minutes. Jobs cannot run for more than 10 minutes.

Default and maximum values

If you specify two limits, the first one is the default (soft) limit for jobs in the queue and the second one is the maximum (hard) limit. Both the default and the maximum limits must be positive integers. The default limit must be less than the maximum limit. The default limit is ignored if it is greater than the maximum limit.

Use the default limit to avoid having to specify resource usage limits in the bsub command.

For example, to specify a default and a maximum run limit, use two values for the RUNLIMIT parameter in lsb.queues:

RUNLIMIT = 10 15

You can specify both default and maximum values for the following resource usage limits in lsb.queues:

Host specification with two limits

If default and maximum limits are specified for CPU time limits or run time limits, only one host specification is permitted. For example, the following CPU limits are correct (and have an identical effect):

The following CPU limit is incorrect:

CPULIMIT = 400/hostA 600/hostB

The following run limits are correct (and have an identical effect):

The following run limit is incorrect:

RUNLIMIT = 10/hostA 15/hostB

Default run limits for backfill scheduling

Default run limits are used for backfill scheduling of parallel jobs.

For example, in lsb.queues, you enter: RUNLIMIT = 10 15

Automatically assigning a default run limit to all jobs in the queue means that backfill scheduling works efficiently.

For example, in lsb.queues, you enter:

RUNLIMIT = 10 15

The first number is the default run limit applied to all jobs in the queue that are submitted without a job-specific run limit. The second number is the maximum run limit.

If you submit a job to the queue without the -W option, the default run limit is used:

% bsub myjob

The job myjob cannot run for more than 10 minutes as specified with the default run limit.

If you submit a job to the queue with the -W option, the maximum run limit is used:

% bsub -W 12 myjob

The job myjob is allowed to run on the queue because the specified run limit (12) is less than the maximum run limit for the queue (15).

% bsub -W 20 myjob

The job myjob is rejected from the queue because the specified run limit (20) is more than the maximum run limit for the queue (15).

Specifying job-level resource usage limits

To specify resource usage limits at the job level, use one of the following bsub options:

Job-level resource usage limits specified at job submission override the queue definitions.

[ Top ]


Supported Resource Usage Limits and Syntax

Core file size limit

Job syntax (bsub) Queue syntax (lsb.queues) Fomat/Units
-C core_limit
CORELIMIT=limit
integer KB

Sets a per-process (soft) core file size limit in KB for each process that belongs to this batch job. On some systems, no core file is produced if the image for the process is larger than the core limit. On other systems only the first core_limit KB of the image are dumped. The default is no soft limit.

CPU time limit

Job syntax (bsub) Queue syntax (lsb.queues) Fomat/Units
-c cpu_limit
CPULIMIT=[default] 
maximum
[hours:]minutes[/host_name | /host_model]

Sets the soft CPU time limit to cpu_limit for this batch job. The default is no limit. This option is useful for avoiding runaway jobs that use up too many resources. LSF keeps track of the CPU time used by all processes of the job.

When the job accumulates the specified amount of CPU time, a SIGXCPU signal is sent to all processes belonging to the job. If the job has no signal handler for SIGXCPU, the job is killed immediately. If the SIGXCPU signal is handled, blocked, or ignored by the application, then after the grace period expires, LSF sends SIGINT, SIGTERM, and SIGKILL to the job to kill it.

You can define whether the CPU limit is a per-process limit enforced by the OS or a per-job limit enforced by LSF with LSB_JOB_CPULIMIT in lsf.conf.

Jobs submitted to a chunk job queue are not chunked if the CPU limit is greater than 30 minutes.

Format

cpu_limit is in the form [hour:]minute, where minute can be greater than 59. 3.5 hours can either be specified as 3:30 or 210.

Normalized CPU time

The CPU time limit is normalized according to the CPU factor of the submission host and execution host. The CPU limit is scaled so that the job does approximately the same amount of processing for a given CPU limit, even if it is sent to a host with a faster or slower CPU.

For example, if a job is submitted from a host with a CPU factor of 2 and executed on a host with a CPU factor of 3, the CPU time limit is multiplied by 2/3 because the execution host can do the same amount of work as the submission host in 2/3 of the time.

If the optional host name or host model is not given, the CPU limit is scaled based on the DEFAULT_HOST_SPEC specified in the lsb.params file. (If DEFAULT_HOST_SPEC is not defined, the fastest batch host in the cluster is used as the default.) If host or host model is given, its CPU scaling factor is used to adjust the actual CPU time limit at the execution host.

The following example specifies that myjob can run for 10 minutes on a DEC3000 host, or the corresponding time on any other host:

% bsub -c 10/DEC3000 myjob

See CPU Time and Run Time Normalization for more information.

Data segment size limit

Job syntax (bsub) Queue syntax (lsb.queues) Fomat/Units
-D data_limit
DATALIMIT=[default] maximum
integer KB

Sets a per-process (soft) data segment size limit in KB for each process that belongs to this batch job. An sbrk() or malloc() call to extend the data segment beyond the data limit returns an error. The default is no soft limit.

File size limit

Job syntax (bsub) Queue syntax (lsb.queues) Fomat/Units
-F file_limit
FILELIMIT=limit
integer KB

Sets a per-process (soft) file size limit in KB for each process that belongs to this batch job. If a process of this job attempts to write to a file such that the file size would increase beyond the file limit, the kernel sends that process a SIGXFSZ signal. This condition normally terminates the process, but may be caught. The default is no soft limit.

Memory limit

Job syntax (bsub) Queue syntax (lsb.queues) Fomat/Units
-M mem_limit
MEMLIMIT=[default] maximum
integer KB

Sets the memory limit, in KB.

If LSB_MEMLIMIT_ENFORCE or LSB_JOB_MEMLIMIT in lsf.conf are set to y, LSF kills the job when it exceeds the memory limit. Otherwise, LSF passes the memory limit to the operating system. Some operating systems apply the memory limit to each process, and some do not enforce the memory limit at all.

LSF memory limit enforcement

To enable LSF memory limit enforcement, set LSB_MEMLIMIT_ENFORCE in lsf.conf to y. LSF memory limit enforcement explicitly sends a signal to kill a running process once it has allocated memory past mem_limit.

You can also enable LSF memory limit enforcement by setting LSB_JOB_MEMLIMIT in lsf.conf to y. The difference between LSB_JOB_MEMLIMIT set to y and LSB_MEMLIMIT_ENFORCE set to y is that with LSB_JOB_MEMLIMIT, only the per-job memory limit enforced by LSF is enabled. The per-process memory limit enforced by the OS is disabled. With LSB_MEMLIMIT_ENFORCE set to y, both the per-job memory limit enforced by LSF and the per-process memory limit enforced by the OS are enabled.

LSB_JOB_MEMLIMIT disables per-process memory limit enforced by the OS and enables per-job memory limit enforced by LSF. When the total memory allocated to all processes in the job exceeds the memory limit, LSF sends the following signals to kill the job: SIGINT first, then SIGTERM, then SIGKILL.

On UNIX, the time interval between SIGINT, SIGKILL, SIGTERM can be configured with the parameter JOB_TERMINATE_INTERVAL in lsb.params.

OS memory limit enforcement

OS enforcement usually allows the process to eventually run to completion. LSF passes mem_limit to the OS which uses it as a guide for the system scheduler and memory allocator. The system may allocate more memory to a process if there is a surplus. When memory is low, the system takes memory from and lowers the scheduling priority (re-nice) of a process that has exceeded its declared mem_limit.

OS memory limit enforcement is only available on systems that support RLIMIT_RSS for setrlimit().

The following operating systems do not support the memory limit at the OS level:

Process limit

Job syntax (bsub) Queue syntax (lsb.queues) Fomat/Units
-p process_limit
PROCESSLIMIT=[default] maximum
integer

Sets the limit of the number of processes to process_limit for the whole job. The default is no limit. Exceeding the limit causes the job to terminate.

Limits the number of concurrent processes that can be part of a job.

If a default process limit is specified, jobs submitted to the queue without a job-level process limit are killed when the default process limit is reached.

If you specify only one limit, it is the maximum, or hard, process limit. If you specify two limits, the first one is the default, or soft, process limit, and the second one is the maximum process limit.

Run time limit

Job syntax (bsub) Queue syntax (lsb.queues) Fomat/Units
-W run_limit
RUNLIMIT=[default] maximum
[hours:]minutes[
/host_name |
/host_model]

A run time limit is the maximum amount of time a job can run before it is terminated. It sets the run time limit of a job. The default is no limit. If the accumulated time the job has spent in the RUN state exceeds this limit, the job is sent a USR2 signal. If the job does not terminate within 10 minutes after being sent this signal, it is killed.

With deadline constraint scheduling configured, a run limit also specifies the amount of time a job is expected to take, and the minimum amount of time that must be available before a job can be started.

Jobs submitted to a chunk job queue are not chunked if the run limit is greater than 30 minutes.

Format

run_limit is in the form [hour:]minute, where minute can be greater than 59. 3.5 hours can either be specified as 3:30 or 210.

Normalized run time

The run time limit is normalized according to the CPU factor of the submission host and execution host. The run limit is scaled so that the job has approximately the same run time for a given run limit, even if it is sent to a host with a faster or slower CPU.

For example, if a job is submitted from a host with a CPU factor of 2 and executed on a host with a CPU factor of 3, the run limit is multiplied by 2/3 because the execution host can do the same amount of work as the submission host in 2/3 of the time.

If the optional host name or host model is not given, the run limit is scaled based on the DEFAULT_HOST_SPEC specified in the lsb.params file. (If DEFAULT_HOST_SPEC is not defined, the fastest batch host in the cluster is used as the default.) If host or host model is given, its CPU scaling factor is used to adjust the actual run limit at the execution host.

The following example specifies that myjob can run for 10 minutes on a DEC3000 host, or the corresponding time on any other host:

% bsub -W 10/DEC3000 myjob

If ABS_RUNLIMIT=Y is defined in lsb.params, the run time limit is not normalized by the host CPU factor. Absolute wall-clock run time is used for all jobs submitted with a run limit.

See CPU Time and Run Time Normalization for more information.

Platform MultiCluster

For MultiCluster jobs, if no other CPU time normalization host is defined and information about the submission host is not available, LSF uses the host with the largest CPU factor (the fastest host in the cluster). The ABS_RUNLIMIT parameter in lsb.params is is not supported in either MultiCluster model; run time limit is normalized by the CPU factor of the execution host.

Thread limit

Job syntax (bsub) Queue syntax (lsb.queues) Fomat/Units
-T thread_limit
THREADLIMIT=[default] maximum
integer

Sets the limit of the number of concurrent threads to thread_limit for the whole job. The default is no limit.

Exceeding the limit causes the job to terminate. The system sends the following signals in sequence to all processes belongs to the job: SIGINT, SIGTERM, and SIGKILL.

If a default thread limit is specified, jobs submitted to the queue without a job- level thread limit are killed when the default thread limit is reached.

If you specify only one limit, it is the maximum, or hard, thread limit. If you specify two limits, the first one is the default, or soft, thread limit, and the second one is the maximum thread limit.

Stack segment size limit

Job syntax (bsub) Queue syntax (lsb.queues) Fomat/Units
-S stack_limit
STACKLIMIT=limit
integer KB

Sets a per-process (soft) stack segment size limit in KB for each process that belongs to this batch job. An sbrk() call to extend the stack segment beyond the stack limit causes the process to be terminated. The default is no soft limit.

Virtual memory (swap) limit

Job syntax (bsub) Queue syntax (lsb.queues) Fomat/Units
-v swap_limit
SWAPLIMIT=limit
integer KB

Sets the total process virtual memory limit to swap_limit in KB for the whole job. The default is no limit. Exceeding the limit causes the job to terminate.

This limit applies to the whole job, no matter how many processes the job may contain.

Examples

Queue-level limits

Job-level limits

[ Top ]


CPU Time and Run Time Normalization

To set the CPU time limit and run time limit for jobs in a platform-independent way, LSF scales the limits by the CPU factor of the hosts involved. When a job is dispatched to a host for execution, the limits are then normalized according to the CPU factor of the execution host.

Whenever a normalized CPU time or run time is given, the actual time on the execution host is the specified time multiplied by the CPU factor of the normalization host then divided by the CPU factor of the execution host.

If ABS_RUNLIMIT=Y is defined in lsb.params, the run time limit is not normalized by the host CPU factor. Absolute wall-clock run time is used for all jobs submitted with a run limit.

Normalization host

If no host or host model is given with the CPU time or run time, LSF uses the default CPU time normalization host defined at the queue level (DEFAULT_HOST_SPEC in lsb.queues) if it has been configured, otherwise uses the default CPU time normalization host defined at the cluster level (DEFAULT_HOST_SPEC in lsb.params) if it has been configured, otherwise uses the submission host.

Example

CPULIMIT=10/hostA

If hostA has a CPU factor of 2, and hostB has a CPU factor of 1 (hostB is slower than hostA), this specifies an actual time limit of 10 minutes on hostA, or on any other host that has a CPU factor of 2. However, if hostB is the execution host, the actual time limit on hostB is 20 minutes (10 * 2 / 1).

Normalization hosts for default CPU and run time limits

The first valid CPU factor encountered is used for both CPU limit and run time limit. To be valid, a host specification must be a valid host name that is a member of the LSF cluster. The CPU factor is used even if the specified limit is not valid.

If the CPU and run limit have different host specifications, the CPU limit host specification is enforced.

If no host or host model is given with the CPU or run time limits, LSF determines the default normalization host according to the following priority:

  1. DEFAULT_HOST_SPEC is configured in lsb.queues
  2. DEFAULT_HOST_SPEC is configured in lsb.params
  3. If DEFAULT_HOST_SPEC is not configured in lsb.queues or lsb.params, host with the largest CPU factor is used.

CPU time display (bacct, bhist, bqueues)

Normalized CPU time is displayed in the output of bqueues. CPU time is not normalized in the output if bacct and bhist.

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: January 12, 2004
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.