[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- About Configured Resources
- Adding New Resources to Your Cluster
- Static Shared Resource Reservation
- External Load Indices and ELIM
- Modifying a Built-In Load Index
[ Top ]
About Configured Resources
LSF schedules jobs based on available resources. There are many resources built into LSF, but you can also add your own resources, and then use them same way as built-in resources.
For maximum flexibility, you should characterize your resources clearly enough so that users have satisfactory choices. For example, if some of your machines are connected to both Ethernet and FDDI, while others are only connected to Ethernet, then you probably want to define a resource called
fddiand associate thefddiresource with machines connected to FDDI. This way, users can specify resourcefddiif they want their jobs to run on machines connected to FDDI.[ Top ]
Adding New Resources to Your Cluster
To add host resources to your cluster, use the following steps:
- Log in to any host in the cluster as the LSF administrator.
- Define new resources in the
Resourcesection oflsf.shared. Specify at least a name and a brief description, which will be displayed to a user bylsinfo.- For static Boolean resources, for all hosts that have the new resources, add the resource name to the RESOURCES column in the
Hostsection oflsf.cluster.cluster_name.- For shared resources, for all hosts that have the new resources, associate the resources with the hosts (you might also have a reason to configure non-shared resources in this section).
See Configuring lsf.cluster.cluster_name ResourceMap Section.
- Reconfigure your cluster.
[ Top ]
Configuring lsf.shared Resource Section
Configured resources are defined in the
Resourcesection oflsf.shared. There is no distinction between shared and non-shared resources.You must specify at least a name and description for the resource, using the keywords RESOURCENAME and DESCRIPTION.
- A resource name cannot begin with a number.
- A resource name cannot contain any of the following characters
: . ( ) [ + - * / ! & | < > @ =- A resource name cannot be any of the following reserved names:
cpu cpuf io login ls idle maxmem maxswp maxtmp type model status it mem ncpus ndisks pg r15m r15s r1m swap swp tmp ut- Resource names are case sensitive
- Resource names can be up to 29 characters in length
You can also specify:
- The resource type (TYPE = Boolean | String | Numeric)
The default is Boolean.
- For dynamic resources, the update interval (INTERVAL, in seconds)
- For numeric resources, where a higher value indicates greater load (INCREASING = Y)
- For numeric shared resources, where LSF releases the resource when a job using the resource is suspended (RELEASE = Y)
When the optional attributes are not specified, the resource is treated as static and Boolean.
Begin Resource RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION mips Boolean () () (MIPS architecture) dec Boolean () () (DECStation system) scratch Numeric 30 N (Shared scratch space on server) synopsys Numeric 30 N (Floating licenses for Synopsys) verilog Numeric 30 N (Floating licenses for Verilog) console String 30 N (User Logged in on console) End Resource[ Top ]
Configuring lsf.cluster.cluster_name ResourceMap Section
Resources are associated with the hosts for which they are defined in the
ResourceMapsection oflsf.cluster.cluster_name.For each resource, you must specify the name and the hosts that have it.
If the
ResourceMapsection is not defined, then any dynamic resources specified inlsf.sharedare not tied to specific hosts, but are shared across all hosts in the cluster.A cluster consists of hosts
host1,host2, andhost3.Begin ResourceMap RESOURCENAME LOCATION verilog (5@[all ~host1 ~host2]) synopsys (2@[host1 host2] 2@[others]) console (1@[host1] 1@[host2]1@[host3]) xyz (1@[default]) End ResourceMapIn this example:
- 5 units of the
verilogresource are defined onhost3only (all hosts excepthost1andhost2).- 2 units of the
synopsysresource are shared betweenhost1andhost2. 2 more units of thesynopsysresource are defined onhost3(shared among all the remaining hosts in the cluster).- 1 unit of the
consoleresource is defined on each host in the cluster (assigned explicitly). 1 unit of thexyzresource is defined on each host in the cluster (assigned with the keyword default).RESOURCENAME
The name of the resource, as defined in
lsf.shared.LOCATION
Defines the hosts that share the resource. For a static resource, you must define an initial value here as well. Do not define a value for a dynamic resource.
Possible states of a resource:
- Each host in the cluster has the resource
- The resource is shared by all hosts in the cluster
- There are multiple instances of a resource within the cluster, and each instance is shared by a unique subset of hosts.
([resource_value@][host_name... | all [~host_name]... | others | default]...)
- For static resources, you must include the resource value, which indicates the quantity of the resource. Do not specify the resource value for dynamic resources because information about dynamic resources is updated by ELIM.
- Type square brackets around the list of hosts, as shown. You can omit the parenthesis if you only specify one set of hosts.
- Each set of hosts within square brackets specifies an instance of the resource. The same host cannot be in more than one instance of a resource. All hosts within the instance share the quantity of the resource indicated by its value.
- The keyword
allrefers to all the server hosts in the cluster, collectively. Use the not operator (~) to exclude hosts or host groups.- The keyword
othersrefers to all hosts not otherwise listed in the instance.- The keyword
defaultrefers to each host in the cluster, individually.Non-batch configuration
The following items should be taken into consideration when configuring resources under LSF Base.
- In
lsf.cluster.cluster_name, theHostsection must precede theResourceMapsection, since theResourceMapsection uses the host names defined in theHostsection.- The RESOURCES column in the
Hostsection of thelsf.cluster.cluster_name file should be used to associate static Boolean resources with particular hosts.- Almost all resources specified in the
ResourceMapsection are interpreted by LSF commands as shared resources, which are displayed usinglsload -sorlshosts -s. The exceptions are:[ Top ]
Static Shared Resource Reservation
You must use resource reservation to prevent over-committing static shared resources when scheduling.
The usual situation is that you configure single-user application licenses as static shared resources, and make that resource one of the job requirements. You should also reserve the resource for the duration of the job. Otherwise, LSF updates resource information, assumes that all the static shared resources can be used, and places another job that requires that license. The additional job cannot actually run if the license is already taken by a running job.
If every job that requests a license and also reserves it, LSF updates the number of licenses at the start of each new dispatch turn, subtracts the number of licenses that are reserved, and only dispatches additional jobs if there are licenses available that are not already in use.
Reserving a static shared resource
To indicate that a shared resource is to be reserved while a job is running, specify the resource name in the
rusagesection of the resource requirement string.You configured licenses for the Verilog application as a resource called
verilog_lic. To submit a job that will run on a host when there is a license available:%bsub -R "select[defined(verilog_lic)] rusage[verilog_lic=1]" myjobIf the job can be placed, the license it uses will be reserved until the job completes.
[ Top ]
External Load Indices and ELIM
The LSF Load Information Manager (LIM) collects built-in load indices that reflect the load situations of CPU, memory, disk space, I/O, and interactive activities on individual hosts.
While built-in load indices might be sufficient for most jobs, you might have special workload or resource dependencies that require custom external load indices defined and configured by the LSF administrator. Load and shared resource information from external load indices, are used the same as built in load indices for job scheduling and host selection.
You can write an External Load Information Manager (ELIM) program that collects the values of configured external load indices and updates LIM when new values are received.
An ELIM can be as simple as a small script, or as complicated as a sophisticated C program. A well-defined protocol allows the ELIM to talk to LIM.
The ELIM executable must be located in LSF_SERVERDIR.
- How LSF supports multiple ELIMs
- Configuring your application-specific SELIM
- How LSF uses ELIM for external resource collection
- Writing an ELIM
- Debugging an ELIM
How LSF supports multiple ELIMs
To increase LIM reliability, LSF Version 6.0 supports the configuration of multiple ELIM executables.
A master ELIM (
melim) is installed in LSF_SERVERDIR.
melimmanages multiple site-defined sub-ELIMs (SELIMs) and reports external load information to LIM.melimdoes the following:
- Starts and stops SELIMs
- Checks syntax of load information reporting on behalf of LIM
- Collects load information reported from SELIMs
- Merges latest valid load reports from each SELIM and sends merged load information back to LIM
Multiple slave ELIMs managed by a master ELIM increases reliability by protecting LIM:
- ELIM output is buffered
- Incorrect resource format or values are checked by ELIM
- SELIMs are independent of each other; one SELIM hanging while waiting for load information does not affect the other SELIMs
MELIM logs its own activities and data into the log file
LSF_LOGDIR/melim.log.host_name.Configuring your application-specific SELIM
The master ELIM is installed as
LSF_SERVERDIR/melim. After installation:
- Define the external resources you need.
- Write your application-specific SELIM to track these resources, as described in Writing an ELIM.
- Put your ELIM in LSF_SERVERIR.
Use the following naming conventions:
- On UNIX,
LSF_SERVERDIR/elim.applicationFor example,
elim.license- On Windows,
LSF_SERVERDIR\elim.application.[exe |bat]For example,
elim.license.exe
Your existing ELIMs do not need to follow this convention and do not need to be renamed. However, since melim invokes any ELIM that follows this convention, you should move any backup copies of your ELIM out of LSF_SERVERDIR or choose a name that does not follow the convention (for example, use elim_bak instead of elim.bak).
The name elim.user is reserved for backward compatibility. Do not use the name elim.user for your application-specific elim.
How LSF uses ELIM for external resource collection
The values of static external resources are specified through the
lsf.cluster.cluster_name configuration file. The values of all dynamic resources, regardless of whether they are shared or host-based, are collected through an ELIM.An ELIM is started in the following situations:
- On every host, if any dynamic resource is configured as host-based. For example, if the LOCATION field in the
ResourceMapsection oflsf.cluster.cluster_name is([default]), then every host will start an ELIM.- On the master host, for any cluster-wide resources. For example, if the LOCATION field in the
ResourceMapsection oflsf.cluster.cluster_name is([all]), then an ELIM is started on the master host.- On the first host specified for each instance, if multiple instances of the resource exist within the cluster. For example, if the LOCATION field in the
ResourceMapsection oflsf.cluster.cluster_name is([hostA hostB hostC] [hostD hostE hostF]), then an ELIM will be stared onhostAandhostDto report the value of that resource for that set of hosts.If the host reporting the value for an instance goes down, then an ELIM is started on the next available host in the instance. In above example, if
hostAbecame unavailable, an ELIM is started onhostB. If thehostAbecomes available again then the ELIM onhostBis shut down and the one onhostAis started.There is only one ELIM on each host, regardless of the number of resources on which it reports. If only cluster-wide resources are to be collected, then an ELIM will only be started on the master host.
When LIM starts, the following environment variables are set for ELIM:
- LSF_MASTER: This variable is defined if the ELIM is being invoked on the master host. It is undefined otherwise. This can be used to test whether the ELIM should report on cluster-wide resources that only need to be collected on the master host.
- LSF_RESOURCES: This variable contains a list of resource names (separated by spaces) on which the ELIM is expected to report. A resource name is only put in the list if the host on which the ELIM is running shares an instance of that resource.
Writing an ELIM
The ELIM must be an executable program, either an interpreted script or compiled code.
The ELIM communicates with the LIM by periodically writing a load update string to its standard output. The load update string contains the number of indices followed by a list of name-value pairs in the following format:
number_indices [index_name index_value]...For example,
3 tmp2 47.5 nio 344.0 licenses 5This string reports three indices:
tmp2,nio, andlicenses, with values 47.5, 344.0, and 5 respectively. Index values must be numbers between-INFINIT_LOADandINFINIT_LOADas defined in thelsf.hheader file.If the ELIM is implemented as a C program, as part of initialization it should use
setbuf(3)to establish unbuffered output tostdout.The ELIM should ensure that the entire load update string is written successfully to
stdout. This can be done by checking the return value ofprintf(3s)if the ELIM is implemented as a C program or as the return code of/bin/echo(1)from a shell script. The ELIM should exit if it fails to write the load information.Each LIM sends updated load information to the master every 15 seconds. Depending on how quickly your external load indices change, the ELIM should write the load update string at most once every 15 seconds. If the external load indices rarely change, the ELIM can write the new values only when a change is detected. The LIM continues to use the old values until new values are received.
The executable for the ELIM must be in LSF_SERVERDIR.
Use the following naming conventions:
- On UNIX,
LSF_SERVERDIR/elim.applicationFor example,
elim.license- On Windows,
LSF_SERVERDIR\elim.application.[exe |bat]For example,
elim.license.exeIf LIM expects some resources to be collected by an ELIM according to configuration, it invokes the ELIM automatically on startup. The ELIM runs with the same user ID and file access permission as the LIM.
The LIM restarts the ELIM if it exits; to prevent problems in case of a fatal error in the ELIM, it is restarted at most once every 90 seconds. When the LIM terminates, it sends a SIGTERM signal to the ELIM. The ELIM must exit upon receiving this signal.
- Write an ELIM.
The following sample ELIM (
LSF_SERVERDIR/elim.mysrc) sets the value ofmyrscresource to2. In a real ELIM, you would have a command to retrieve whatever value you want to retrieve and set the value.#!/bin/sh while : do # set the value for resource "myrsc" val=2 # create an output string in the format: # number_indices index1_name index1_value... reportStr="1 myrsc $val" echo "$reportStr" # wait for 30 seconds before reporting again sleep 30 done- Test this ELIM by running it from the command line.
%./elim.myrscIt should give you the output:
1 myrsc 2- Copy the ELIM to LSF_SERVERDIR and make sure it has the name
elim.myrsrc.- Define the
myrscresource inlsf.shared.In this case, we are defining the resource as Numeric because we want it to accept numbers. The value does not increase with load.
Begin Resource RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION myrsc Numeric 30 N (custom resource to trigger elim to start up) End Resource- Map the
myrscresource to hosts inlsf.cluster.cluster_name. In this case, we want this resource to reside only onhostA.Begin ResourceMap RESOURCENAME LOCATION myrsc [hostA] End ResourceMap- Reconfigure LSF with the commands:
- Display the resource with the command
lsload -l. You should be able to see the new resource and value:HOST_NAME status r15s r1m r15m ut pg io ls it tmp swp mem myrsc hostA ok 0.4 0.4 0.4 0% 0.0 0 22 0 24M 26M 6M 2Example code for an ELIM is included in the
LSF_MISC/examplesdirectory. Theelim.cfile is an ELIM written in C. You can modify this example to collect the load indices you want.Debugging an ELIM
Set the parameter
LSF_ELIM_DEBUG=yin the Parameters section oflsf.cluster.cluster_name to log all load information received by LIM from the ELIM in the LIM log file.Set the parameter
LSF_ELIM_BLOCKTIME=seconds in the Parameters section oflsf.cluster.cluster_name to configure how long LIM waits before restarting the ELIM.Use the parameter
LSF_ELIM_RESTARTS=integer in the Parameters section oflsf.cluster.cluster_name to limit the number of times an ELIM can be restarted.See the Platform LSF Reference for more details on these parameters.
[ Top ]
Modifying a Built-In Load Index
The ELIM can return values for the built-in load indices. In this case the value produced by the ELIM overrides the value produced by the LIM.
Considerations
- The ELIM must ensure that the semantics of any index it supplies are the same as that of the corresponding index returned by the
lsinfo(1)command.- The name of an external load index must not be one of the resource name aliases:
cpu,idle,login, orswap. To override one of these indices, use its formal name:r1m,it,ls, orswpas the ELIM output.- You must configure an external load index in
lsf.sharedeven if you are overriding a built-in load index.Steps
For example, some sites prefer to use
/usr/tmpfor temporary files.To override the
tmpload index:
- Write a program that periodically measures the space in the
/usr/tmpfile system and writes the value to standard output. For details on format, see Writing an ELIM.For example, the program writes to its standard output:
1 tmp 47.5- Name the program
elimand store it in the LSF_SERVERDIR directory.All default load indices are local resources, so the
elimmust run locally on every machine.- Define the resource.
Since the name of built-in load indices is not allowed in
lsf.shared, define a custom resource to trigger theelim.For example:
Begin Resource RESOURCENAME TYPE INTERVAL INCREASING DESCRIPTION my_tmp Numeric 30 N (custom resource to trigger elim to start up) End Resource- Map the resource to hosts in
lsf.cluster.cluster_name.
- To override the
tmpload index on every host, specify the keyworddefault:Begin ResourceMap RESOURCENAME LOCATION my_tmp [default] End ResourceMap- To override the
tmpload index only on specific hosts, specify the host names:Begin ResourceMap RESOURCENAME LOCATION my_tmp ([host1][host2][host3]) End ResourceMap
[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: January 12, 2004
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.