[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- Host States
- Viewing Host Information
- Controlling Hosts
- Adding a Host
- Removing a Host
- Adding and Removing Hosts Dynamically
- Adding Host Types and Host Models to lsf.shared
- Registering Service Ports
- Host Naming
- Hosts with Multiple Addresses
- Host Groups
- Tuning CPU Factors
- Handling Host-level Job Exceptions
[ Top ]
Host States
Host states describe the ability of a host to accept and run batch jobs in terms of daemon states, load levels, and administrative controls. The
bhostsandlsloadcommands display host states.bhosts
Displays the current state of the host:
Displays the closed reasons. A closed host will not accept new batch jobs:
%bhostsHOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV hostA ok - 55 2 2 0 0 0 hostB closed - 20 16 16 0 0 0 ... %bhosts -l hostBHOST hostB STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW closed_Adm 23.10 - 55 2 2 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 1.0 -0.0 -0.0 4% 9.4 148 2 3 4231M 698M 233M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - -lsload
Displays the current state of the host:
$lsloadHOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostA ok 0.0 0.0 0.0 4% 0.4 0 4316 10G 302M 252M hostB ok 1.0 0.0 0.0 4% 8.2 2 14 4231M 698M 232M ...[ Top ]
Viewing Host Information
LSF uses some or all of the hosts in a cluster as execution hosts. The host list is configured by the LSF administrator. Use the
bhostscommand to view host information. Use thelsloadcommand to view host load information.
Viewing all hosts in the cluster and their status
Run
bhoststo display information about all hosts and their status. For example:%bhostsHOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV hostA ok 2 2 0 0 0 0 0 hostD ok 2 4 2 1 0 0 1 hostB ok 1 2 2 1 0 1 0Viewing detailed server host information
Run
bhosts -lhost_name andlshosts -lhost_name to display all information about each server host such as the CPU factor and the load thresholds to start, suspend, and resume jobs. For example:%bhosts -l hostBHOST hostB STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDO WS ok 20.20 - - 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 0.1 0.1 0.1 9% 0.7 24 17 0 394M 396M 12M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - %lshosts -l hostBHOST_NAME: hostB type model cpuf ncpus ndisks maxmem maxswp maxtmp rexpri serve r Sun4 Ultra2 20.2 2 1 256M 710M 466M 0 Ye s RESOURCES: Not defined RUN_WINDOWS: (always open) LICENSES_ENABLED: (LSF_Base LSF_Manager LSF_MultiCluster LSF_Make LSF_Parallel) LOAD_THRESHOLDS: r15s r1m r15m ut pg io ls it tmp swp mem - 1.0 - - - - - - - - 4MViewing host load by host
The
lsloadcommand reports the current status and load levels of hosts in a cluster. Thelshosts -lcommand shows the load thresholds.The
lsmoncommand provides a dynamic display of the load information. The LSF administrator can find unavailable or overloaded hosts with these tools.Run
lsloadto see load levels for each host. For example:%lsloadHOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostD ok 1.3 1.2 0.9 92% 0.0 2 20 5M 148M 88M hostB -ok 0.1 0.3 0.7 0% 0.0 1 67 45M 25M 34M hostA busy 8.0 *7.0 4.9 84% 4.6 6 17 1M 81M 27MThe first line lists the load index names, and each following line gives the load levels for one host.
Viewing host architecture information
An LSF cluster may consist of hosts of differing architectures and speeds. The
lshostscommand displays configuration information about hosts. All these parameters are defined by the LSF administrator in the LSF configuration files, or determined by the LIM directly from the system.Host types represent binary compatible hosts; all hosts of the same type can run the same executable. Host models give the relative CPU performance of different processors. For example:
%lshostsHOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES hostD SUNSOL SunSparc 6.0 1 64M 112M Yes (solaris cserver) hostB ALPHA DEC3000 10.0 1 94M 168M Yes (alpha cserver) hostM RS6K IBM350 7.0 1 64M 124M Yes (cserver aix) hostC SGI6 R10K 14.0 16 1024M 1896M Yes (irix cserver) hostA HPPA HP715 6.0 1 98M 200M Yes (hpux fserver)In the above example, the host type
SUNSOLrepresents Sun SPARC systems running Solaris, andSGI6represents an SGI server running IRIX 6. Thelshostscommand also displays the resources available on each host.The host CPU architecture. Hosts that can run the same binary programs should have the same type.
An
UNKNOWNtype or model indicates the host is down, or LIM on the host is down. See UNKNOWN host type or model for instructions on measures to take.When automatic detection of host type or model fails, the type or model is set to
DEFAULT. LSF will work on the host. ADEFAULTmodel may be inefficient because of incorrect CPU factors. ADEFAULTtype may cause binary incompatibility because a job from aDEFAULThost type can be migrated to anotherDEFAULThost type.Viewing host history
Run
badmin hhistto view the history of a host such as when it is opened or closed. For example:%badmin hhist hostBWed Nov 20 14:41:58: Host <hostB> closed by administrator <lsf>. Wed Nov 20 15:23:39: Host <hostB> opened by administrator <lsf>.Viewing host model and type information
Run
lsinfo -mto display information about host models that exist in the cluster:%lsinfo -mMODEL_NAME CPU_FACTOR ARCHITECTURE PC1133 23.10 x6_1189_PentiumIIICoppermine HP9K735 4.50 HP9000735_125 HP9K778 5.50 HP9000778 Ultra5S 10.30 SUNWUltra510_270_sparcv9 Ultra2 20.20 SUNWUltra2_300_sparc Enterprise3000 20.00 SUNWUltraEnterprise_167_sparc
Run lsinfo -Mto display all host models defined inlsf.shared:%lsinfo -MMODEL_NAME CPU_FACTOR ARCHITECTURE UNKNOWN_AUTO_DETECT 1.00 UNKNOWN_AUTO_DETECT DEFAULT 1.00 LINUX133 2.50 x586_53_Pentium75 PC200 4.50 i86pc_200 Intel_IA64 12.00 ia64 Ultra5S 10.30 SUNWUltra5_270_sparcv9 PowerPC_G4 12.00 x7400G4 HP300 1.00 SunSparc 12.00
Run lim -tto display the model of the current host. You must be the LSF administrator to use this command:%lim -tHost Type : SOL732 Host Architecture : SUNWUltra2_200_sparcv9 Matched Type : SOL732 Matched Architecture : SUNWUltra2_300_sparc Matched Model : Ultra2 CPU Factor : 20.2Viewing job exit rate and load for hosts
Use
bhoststo display the exception threshold for job exit rate and the current load value for hosts. For example, EXIT_RATE forhostAis configured as 4 jobs per minute.hostAdoes not currently exceed this rate:% bhosts -l hostA HOST hostA STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW ok 18.60 - 1 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 0.0 0.0 0.0 0% 0.0 0 1 2 646M 648M 115M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M share_rsrc host_rsrc Total 3.0 2.0 Reserved 0.0 0.0 LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 4.00 Load 0.00Use
bhosts -xto see hosts whose job exit rate has exceeded the threshold for longer than JOB_EXIT_RATE_DURATION, and are still high. By default, these hosts will be closed the next time LSF checks host exceptions and invokeseadmin.If no hosts exceed the job exit rate,
bhosts -xdisplays:There is no exceptional host found[ Top ]
Controlling Hosts
Hosts are opened and closed by an LSF Administrator or root issuing a command or through configured dispatch windows.
Closing a host
Run
badmin hclose:%badmin hclose hostBClose <hostB> ...... doneIf the command fails, it may be because the host is unreachable through network problems, or because the daemons on the host are not running.
Opening a host
Run
badmin hopen:%badmin hopen hostBOpen <hostB> ...... doneDispatch Windows
A dispatch window specifies one or more time periods during which a host will receive new jobs. The host will not receive jobs outside of the configured windows. Dispatch windows do not affect job submission and running jobs (they are allowed to run until completion). By default, dispatch windows are not configured.
To configure dispatch windows:
- Edit
lsb.hosts.- Specify on or more time windows in the DISPATCH_WINDOW column. For example:
Begin Host HOST_NAME r1m pg ls tmp DISPATCH_WINDOW ... hostB 3.5/4.5 15/ 12/15 0 (4:30-12:00) ... End Host- Reconfigure the cluster:
- Run
bhosts -lto display the dispatch windows.Logging a comment when closing or opening a host
Use the
-Coption ofbadmin hcloseandbadmin hopento log an administrator comment inlsb.events. For example,% badmin hclose -C "Weekly backup" hostBThe comment text
Weekly backupis recorded inlsb.events. If you close or open a host group, each host group member displays with the same comment string.A new event record is recorded for each host open or host close event. For example:
%badmin hclose -C "backup" hostAfollowed by
%badmin hclose -C "Weekly backup" hostAwill generate records in
lsb.events:"HOST_CTRL" "6.0 1050082346 1 "hostA" 32185 "lsfadmin" "backup" "HOST_CTRL" "6.0 1050082373 1 "hostA" 32185 "lsfadmin" "Weekly backup"Use
badmin historbadmin hhistto display administrator comments for closing and opening hosts. For example:%badmin hhistFri Apr 4 10:35:31: Host <hostB> closed by administrator <lsfadmin> Weekly backup.
bhosts -lalso displays the comment text:% bhosts -l HOST hostA STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW closed_Adm 1.00 - - 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem Total 0.0 0.0 0.0 2% 0.0 64 2 11 7117M 512M 432M Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - THRESHOLD AND LOAD USED FOR EXCEPTIONS: JOB_EXIT_RATE Threshold 2.00 Load 0.00 ADMIN ACTION COMMENT: "Weekly backup"How events are displayed and recorded in MultiCluster lease model
In the MultiCluster resource lease model, host control administrator comments are recorded only in the
lsb.eventsfile on the local cluster.badmin histandbadmin hhistdisplay only events that are recorded locally. Host control messages are not passed between clusters in the MultiCluster lease model. For example. if you close an exported host in both the consumer and the provider cluster, the host close events are recorded separately in their locallsb.events.
[ Top ]
Adding a Host
Use
lsfinstallto add a host to an LSF cluster.See the Platform LSF Reference for more information about
lsfinstall.Adding a host of an existing type using lsfinstall
lsfinstall is not compatible with clusters installed with lsfsetup. To add a host to a cluster originally installed with lsfsetup, you must upgrade your cluster.
- Verify that the host type already exists in your cluster:
- Log on to any host in the cluster. You do not need to be root.
- List the contents of the LSF_TOP/6.0 directory. The default is /
usr/share/lsf/6.0. If the host type currently exists, there will be a subdirectory with the name of the host type. If it does not exist, go to Adding a host of a new type using lsfinstall.- Add the host information to
lsf.cluster.cluster_name:
- Log on to the LSF master host as root.
- Edit
LSF_CONFDIR/lsf.cluster.cluster_name, and specify the following in theHostsection:
- The name of the host.
- The model and type, or specify ! to automatically detect the type or model.
- Specify
1for LSF server or0for LSF client. For example:Begin HostHOSTNAME model type server r1m mem RESOURCES REXPRI hosta ! SUNSOL6 1 1.0 4 () 0 hostb ! SUNSOL6 0 1.0 4 () 0 hostc ! HPPA1132 1 1.0 4 () 0 hostd ! HPPA1164 1 1.0 4 () 0 End Host- Save your changes.
- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin mbdrestartto restartmbatchd.- Run
hostsetupto set up the new host and configure the daemons to start automatically at boot from/usr/share/lsf/6.0/install:#./hostsetup --top="/usr/share/lsf" --boot="y"- Start LSF on the new host:
#lsadmin limstartup#lsadmin resstartup#badmin hstartup- Run
bhostsandlshoststo verify your changes.
- If any host type or host model is UNKNOWN, follow the steps in UNKNOWN host type or model to fix the problem.
- If any host type or host model is DEFAULT, follow the steps in DEFAULT host type or model to fix the problem.
Adding a host of a new type using lsfinstall
lsfinstall is not compatible with clusters installed with lsfsetup. To add a host to a cluster originally installed with lsfsetup, you must upgrade your cluster.
- Verify that the host type does not already exist in your cluster:
- Log on to any host in the cluster. You do not need to be root.
- List the contents of the LSF_TOP/6.0 directory. The default is
/usr/share/lsf/6.0. If the host type currently exists, there will be a subdirectory with the name of the host type. If the host type already exists, go to Adding a host of an existing type using lsfinstall.- Get the LSF distribution tar file for the host type you want to add.
- Log on as root to any host that can access the LSF install directory.
- Change to the LSF install directory. The default is
/usr/share/lsf/6.0/install- Edit
install.config:
- For LSF_TARDIR, specify the path to the tar file. For example:
LSF_TARDIR="/usr/share/lsf_distrib/6.0"- For LSF_ADD_SERVERS, list the new host names enclosed in quotes and separated by spaces. For example:
LSF_ADD_SERVERS="hosta hostb"- Run
./lsfinstall -finstall.config. This automatically creates the host information in lsf.cluster.cluster_name.- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin reconfigto reconfigurembatchd.- Run
hostsetupto set up the new host and configure the daemons to start automatically at boot from/usr/share/lsf/6.0/install:#./hostsetup --top="/usr/share/lsf" --boot="y"- Start LSF on the new host:
#lsadmin limstartup#lsadmin resstartup#badmin hstartup- Run
bhostsandlshoststo verify your changes.
- If any host type or host model is UNKNOWN, follow the steps in UNKNOWN host type or model to fix the problem.
- If any host type or host model is DEFAULT, follow the steps in DEFAULT host type or model to fix the problem.
[ Top ]
Removing a Host
Removing a host from LSF involves preventing any additional jobs from running on the host, removing the host from LSF, and removing the host from the cluster.
Never remove the master host from LSF. If you want to remove your current default master from LSF, change
lsf.cluster.cluster_name to assign a different default master host. Then remove the host that was once the master host.
- Log on to the LSF host as root.
- Run
badmin hcloseto close the host. This prevents jobs from being dispatched to the host and allows running jobs to finish.- When all dispatched jobs are finished, run
lsfshutdownto stop the LSF daemons.- Remove any references to the host in the Host section of
LSF_CONFDIR/lsf.cluster.cluster_name.- Remove any other references to the host, if applicable, from the following LSF configuration files:
- Log off the host to be removed, and log on as
rootor the primary LSF administrator to any other host in the cluster.- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin mbdrestartto restartmbatchd.- If you configured LSF daemons to start automatically at system startup, remove the LSF section from the host's system startup files.
- If any users of the host use
lstcshas their login shell, change their login shell totcshorcsh. Removelstcshfrom the/etc/shellsfile.[ Top ]
Adding and Removing Hosts Dynamically
By default, all configuration changes made to LSF are static. You must manually change the configuration and restart the cluster (or at least all master candidates). Dynamic host configuration allows you to add hosts to the cluster or remove them without manually changing the configuration.
When the dynamic host configuration is enabled, any host will be able to join the cluster. You can limit which hosts can be LSF hosts with the parameter LSF_HOST_ADDR_RANGE in
lsf.cluster.cluster_name.How dynamic host configuration works
For dynamic host configuration, the master LIM:
- Receives request to add hosts
- Informs other master candidates to refresh themselves when a host is added or removed
- Detects host unavailability and, if LSF_DYNAMIC_HOST_TIMEOUT is defined, removes unavailable hosts that are not master candidates
Master candidate LIMs (LSF_MASTER_LIST)
To enable dynamic host configuration, you must define LSF_MASTER_LIST in
lsf.conf. Specify a list of hosts that are candidates to become the master host for the cluster.This set of hosts reads the LSF configuration files when a new host is added to the cluster; other hosts (slave hosts) only receive the host configuration from master LIM. LSF_MASTER_LIST also identifies the hosts that need to be reconfigured after configuration change.
Master candidate hosts are informed when a new host is added. When a master candidate becomes master host, its LIM receives requests from dynamic hosts to add them to the cluster.
Master candidate hosts should share LSF configuration and binaries.
Dynamically added LSF hosts that will not be master candidates are slave hosts. Each dynamic slave host has its own LSF binaries and local
lsf.confand shell environment scripts (cshrc.lsfandprofile.lsf). You must install LSF on each slave host.If LSF_STRICT_CHECKING is defined in lsf.conf to protect your cluster in untrusted environments, and your cluster has slave hosts that are dynamically added, LSF_STRICT_CHECKING must be configured in the local
lsf.confon all slave hosts.Slave LIMs report their availability to the master LIM when they start. When each slave host starts, it first contacts the master LIM to add itself to the cluster. The master host adds the host if it is not in its host table, or returns
okif the host has already been added.Use LSF_LOCAL_RESOURCES in a localized
lsf.confto define instances of local resources residing on the slave host:
- For numeric resources, defined name-value pairs:
[resourcemapvalue*resource_name]- For Boolean resources, the value will be the resource name in the form:
[resourceresource_name]When the slave host calls the master host to add itself, it also reports its local resources. The local resources to be added must be defined in
lsf.shared.If the same resource is already defined in
lsf.cluster.cluster_name asdefaultorall, it cannot be added as a local resource. The shared resource overrides the local one.Resources must already be mapped to hosts in the ResourceMap section of lsf.cluster.cluster_name. If the ResourceMap section does not exist, local resources are not added.
LSF_LOCAL_RESOURCES is usually set in theslave.configfile during installation. If LSF_LOCAL_RESOURCES are already defined in a locallsf.confon the slave host,lsfinstalldoes not add resources you define in LSF_LOCAL_RESOURCES inslave.config. You should not have duplicate LSF_LOCAL_RESOURCES entries in lsf.conf. If local resources are defined more than once, only the last definition is valid.
The
lsadmincommand is now able to run on a non-LSF host. Uselsadminlimstartupto start LIM on a newly added dynamic host.If the master LIM dies, the next master candidate will have the same knowledge as the master LIM about dynamically added hosts in the cluster.
mbatchdgets host information from master LIM; when it detects that a host has been added or removed dynamically,mbatchdautomatically reconfigures itself.
After adding a batch host dynamically, you may have to wait a few moments formbatchdto detect the host and reconfigure. Depending on system load,mbatchdmay wait up to a maximum of 10 minutes before reconfiguring.
Host configuration in lsb.hosts and lsb.queues
For host configuration in
lsb.hostsandlsb.queuesto apply to dynamically added hosts, usedefaultorall, as appropriate, to enable configuration to apply to all hosts in the cluster.Adding dynamic hosts in a shared file system
If the new dynamically added hosts share the same set of configuration and binary files with normal hosts, you only need to start the LSF daemons on that host and the host is recognized by the master as an LSF host.
- Specify the installation options in
install.config.The following parameters are required:
- Use
lsfinstall -f install.configto install LSF.
- On the master host, configure the following parameters:
lsf.conf:
- LSF_MASTER_LIST="host_name [host_name ...]"
List the hosts that are candidates to become the master host for the cluster.
- LSF_DYNAMIC_HOST_TIMEOUT=timeout[m | M] (optional)
Set an optional timeout value in hours or minutes. If the dynamic host is unavailable for longer than the time specified, it is removed from the cluster. To specify a value in minutes, append "
m" or "M" to the timeout value.lsf.cluster.cluster_name (optional)- Log on as root to each host you want to join a cluster.
- Use one of the following to set the LSF environment:
- Optionally, run
hostsetupon each LSF server host.You only need to run
hostsetupif you want LSF to automatically start when the host is rebooted. For example:#cd /usr/share/lsf/5.1/install#./hostsetup --top="/usr/share/lsf" --boot="y"For complete
hostsetupusage, enterhostsetup -h.- Use the following commands start LSF:
#lsadmin limstartup#lsadmin resstartup#badmin hstartupAdding dynamic hosts in a non-shared file system (slave hosts)
If each dynamic slave host has its own LSF binaries and local
lsf.confand shell environment scripts (cshrc.lsfandprofile.lsf), you must install LSF on each slave host.
- Specify installation options in the
slave.configfile.The following parameters are required:
The following parameters are optional:
- LSF_LIM_PORT=port_number
If the master host does not use the default LSF_LIM_PORT, you must specify the same LSF_LIM_PORT defined in
lsf.confon the master host.- LSF_LOCAL_RESOURCES=resource ...
Defines the local resources for a dynamic host.
- For numeric resources, defined name-value pairs:
[resourcemapvalue*resource_name]- For Boolean resources, the value will be the resource name in the form:
[resourceresource_name]For example:
LSF_LOCAL_RESOURCES=[resourcemap 1*verilog] [resource linux]
If LSF_LOCAL_RESOURCES are already defined in a locallsf.confon the slave host,lsfinstalldoes not add resources you define in LSF_LOCAL_RESOURCES inslave.config.
- Use
lsfinstall -s -f slave.configto install a dynamic slave host.
lsfinstallcreates a locallsf.conffor the slave host, which sets the following parameters:- Use one of the following to set the LSF environment:
- Optionally, run
hostsetupon each LSF server host.You only need to run
hostsetupif you want LSF to automatically start when the host is rebooted. For example:#cd /usr/local/lsf/5.1/install#./hostsetup --top="/usr/local/lsf" --boot="y"For complete
hostsetupusage, enterhostsetup -h.- Use the following commands start LSF:
#lsadmin limstartup#lsadmin resstartup#badmin hstartupThe first time a non-shared slave host joins the cluster, daemons on the new host can only be started on local host. For example, the LSF administrator cannot start daemons on
hostBfromhostAby usinglsadmin limstartup hostB. Instead, the first time the host joins the cluster, use:#rsh hostB lsadmin limstartupAllowing only certain hosts to join the cluster
By default, any host can be dynamically added to the cluster. To avoid having unauthorized hosts join the cluster, you can optionally use LSF_HOST_ADDR_RANGE in
lsf.cluster.cluster_name to identify a range of IP addresses to identify hosts that are allowed to be dynamically added as LSF hosts.LSF_HOST_ADDR_RANGE (lsf.cluster.cluster_name)
If a value is defined for LSF_HOST_ADDR_RANGE, security for dynamically adding and removing hosts is enabled, and only hosts with IP addresses within the specified range can be added to or removed from a cluster dynamically.
Automatic removal of dynamically added hosts
By default, dynamically added hosts remain in the cluster permanently. Optionally, you can use LSF_DYNAMIC_HOST_TIMEOUT in
lsf.confto set an optional timeout value in hours or minutes.LSF_DYNAMIC_HOST_TIMEOUT (lsf.conf)
If LSF_DYNAMIC_HOST_TIMEOUT is defined and a host is not a master candidate, when the host is unavailable for longer than the value specified, it is removed from the cluster.
[ Top ]
Adding Host Types and Host Models to lsf.shared
The
lsf.sharedfile contains a list of host type and host model names for most operating systems. You can add to this list or customize the host type and host model names. A host type and host model name can be any alphanumeric string up to 29 characters long.Adding a custom host type or model
- Log on as the LSF administrator on any host in the cluster.
- Edit
lsf.shared:
- For a new host type, modify the
HostTypesection:Begin HostType TYPENAME # Keyword DEFAULT CRAYJ CRAYC CRAYT DigitalUNIX HPPA IBMAIX4 SGI6 SUNSOL SONY WIN95 End HostType- For a new host model, modify the
HostModelsection:Add the new model and its CPU speed factor relative to other models. For more details on tuning CPU factors, see Tuning CPU Factors.
Begin HostModel MODELNAME CPUFACTOR ARCHITECTURE # keyword # x86 (Solaris, NT, Linux): approximate values, based on SpecBench results # for Intel processors (Sparc/NT) and BogoMIPS results (Linux). PC75 1.5 (i86pc_75 i586_75 x586_30) PC90 1.7 (i86pc_90 i586_90 x586_34 x586_35 x586_36) HP9K715 4.2 (HP9000715_100) SunSparc 12.0 () CRAYJ90 18.0 () IBM350 18.0 () End HostModel- Save the changes to
lsf.shared.- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin reconfigto reconfigurembatchd.[ Top ]
Registering Service Ports
LSF uses dedicated UDP and TCP ports for communication. All hosts in the cluster must use the same port numbers to communicate with each other.
The service port numbers can be any numbers ranging from 1024 to 65535 that are not already used by other services. To make sure that the port numbers you supply are not already used by applications registered in your service database check
/etc/servicesor use the commandypcat services.By default, port numbers for LSF services are defined in the
lsf.conffile. You can also configure ports by modifying/etc/servicesor the NIS or NIS+ database. If you define port numberslsf.conf, port numbers defined in the service database are ignored.lsf.conf
- Log on to any host as
root.- Edit
lsf.confand add the following lines:LSF_LIM_PORT=3879 LSF_RES_PORT=3878 LSB_MBD_PORT=3881 LSB_SBD_PORT=3882- Add the same entries to
lsf.confon every host.- Save
lsf.conf.- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin mbdrestartto restartmbatchd.- Run
lsfstartupto restart all daemons in the cluster./etc/services
During installation, use thehostsetup --boot="y"option to set up the LSF port numbers in the service database.
Use the file
LSF_TOP/version/install/instlib/example.servicesfile as a guide for adding LSF entries to the services database.If any other service listed in your services database has the same port number as one of the LSF services, you must change the port number for the LSF service. You must use the same port numbers on every LSF host.
- Log on to any host as
root.- Edit the
/etc/servicesfile by adding the contents of theLSF_TOP/version/install/instlib/example.servicesfile:# /etc/services entries for LSF daemons # res 3878/tcp # remote execution server lim 3879/udp # load information manager mbatchd 3881/tcp # master lsbatch daemon sbatchd 3882/tcp # slave lsbatch daemon # # Add this if ident is not already defined # in your /etc/services file ident 113/tcp auth tap # identd- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin reconfigto reconfigurembatchd.- Run
lsfstartupto restart all daemons in the cluster.NIS or NIS+ database
If you are running NIS, you only need to modify the services database once per NIS master. On some hosts the NIS database and commands are in the
/var/ypdirectory; on others, NIS is found in/etc/yp.
- Log on to any host as
root.- Run
lsfshutdownto shut down all the daemons in the cluster- To find the name of the NIS master host, use the command:
%ypwhich -m services- Log on to the NIS master host as
root.- Edit the
/var/yp/src/servicesor/etc/yp/src/servicesfile on the NIS master host adding the contents of theLSF_TOP/version/install/instlib/example.servicesfile:# /etc/services entries for LSF daemons. # res 3878/tcp # remote execution server lim 3879/udp # load information manager mbatchd 3881/tcp # master lsbatch daemon sbatchd 3882/tcp # slave lsbatch daemon # # Add this if ident is not already defined # in your /etc/services file ident 113/tcp auth tap # identdMake sure that all the lines you add either contain valid service entries or begin with a comment character (
#). Blank lines are not allowed.- Change the directory to
/var/ypor/etc/yp.- Use the following command:
%ypmake servicesOn some hosts the master copy of the services database is stored in a different location.
On systems running NIS+ the procedure is similar. Refer to your system documentation for more information.
- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin reconfigto reconfigurembatchd.- Run
lsfstartupto restart all daemons in the cluster.
[ Top ]
Host Naming
LSF needs to match host names with the corresponding Internet host addresses.
LSF looks up host names and addresses the following ways:
- In the
/etc/hostsfile- Sun Network Information Service/Yellow Pages (NIS or YP)
- Internet Domain Name Service (DNS).
DNS is also known as the Berkeley Internet Name Domain (BIND) or
named, which is the name of the BIND daemon.Each host is configured to use one or more of these mechanisms.
Network addresses
Each host has one or more network addresses; usually one for each network to which the host is directly connected. Each host can also have more than one name.
The first name configured for each address is called the official name.
Other names for the same host are called aliases.
LSF uses the configured host naming system on each host to look up the official host name for any alias or host address. This means that you can use aliases as input to LSF, but LSF always displays the official name.
Host name services
On Digital Unix systems, the
/etc/svc.conffile controls which host name service is used.On Solaris systems, the
/etc/nsswitch.conffile controls the name service.On other UNIX platforms, the following rules apply:
- If your host has an
/etc/resolv.conffile, your host is using DNS for name lookups- If the command
ypcat hostsprints out a list of host addresses and names, your system is looking up names in NIS- Otherwise, host names are looked up in the
/etc/hostsfileFor more information
The man pages for the
gethostbynamefunction, theypbindandnameddaemons, theresolverfunctions, and thehosts,svc.conf,nsswitch.conf, andresolv.conffiles explain host name lookups in more detail.[ Top ]
Hosts with Multiple Addresses
Hosts that have more than one network interface usually have one Internet address for each interface. Such hosts are called multi-homed hosts. LSF identifies hosts by name, so it needs to match each of these addresses with a single host name. To do this, the host name information must be configured so that all of the Internet addresses for a host resolve to the same name.
There are two ways to do it:
- Modify the system hosts file (
/etc/hosts) and the changes will affect the whole system- Create an LSF hosts file (
LSF_CONFDIR/hosts) and LSF will be the only application that resolves the addresses to the same hostMultiple network interfaces
Some system manufacturers recommend that each network interface, and therefore, each Internet address, be assigned a different host name. Each interface can then be directly accessed by name. This setup is often used to make sure NFS requests go to the nearest network interface on the file server, rather than going through a router to some other interface. Configuring this way can confuse LSF, because there is no way to determine that the two different names (or addresses) mean the same host. LSF provides a workaround for this problem.
All host naming systems can be configured so that host address lookups always return the same name, while still allowing access to network interfaces by different names. Each host has an official name and a number of aliases, which are other names for the same host. By configuring all interfaces with the same official name but different aliases, you can refer to each interface by a different alias name while still providing a single official name for the host.
Configuring the LSF hosts file
If your LSF clusters include hosts that have more than one interface and are configured with more than one official host name, you must either modify the host name configuration, or create a private
hostsfile for LSF to use.The LSF
hostsfile is stored in LSF_CONFDIR. The format ofLSF_CONFDIR/hostsis the same as for/etc/hosts.In the LSF
hostsfile, duplicate the systemhostsdatabase information, except make all entries for the host use the same official name. Configure all the other names for the host as aliases so that people can still refer to the host by any name.For example, if your
/etc/hostsfile contains:AA.AA.AA.AA host-AA host # first interface BB.BB.BB.BB host-BB # second interfacethen the
LSF_CONFDIR/hostsfile should contain:AA.AA.AA.AA host host-AA # first interface BB.BB.BB.BB host host-BB # second interfaceExample /etc/hosts entries
The following example is for a host with two interfaces, where the host does not have a unique official name.
# Address Official name Aliases # Interface on network A AA.AA.AA.AA host-AA.domain host.domain host-AA host # Interface on network B BB.BB.BB.BB host-BB.domain host-BB hostLooking up the address
AA.AA.AA.AAfinds the official namehost- AA.domain. Looking up addressBB.BB.BB.BBfinds the namehost- BB.domain. No information connects the two names, so there is no way for LSF to determine that both names, and both addresses, refer to the same host.To resolve this case, you must configure these addresses using a unique host name. If you cannot make this change to the system file, you must create an LSF hosts file and configure these addresses using a unique host name in that file.
Here is the same example, with both addresses configured for the same official name.
# Address Official name Aliases # Interface on network A AA.AA.AA.AA host.domain host-AA.domain host- AA host # Interface on network B BB.BB.BB.BB host.domain host-BB.domain host- BB hostWith this configuration, looking up either address returns
host.domainas the official name for the host. LSF (and all other applications) can determine that all the addresses and host names refer to the same host. Individual interfaces can still be specified by using thehost-AAandhost-BBaliases.Sun's NIS uses the
/etc/hostsfile on the NIS master host as input, so the format for NIS entries is the same as for the/etc/hostsfile.Since LSF can resolve this case, you do not need to create an LSF hosts file.
DNS configuration
The configuration format is different for DNS. The same result can be produced by configuring two address (A) records for each Internet address. Following the previous example:
# name class type address host.domain IN A AA.AA.AA.AA host.domain IN A BB.BB.BB.BB host-AA.domain IN A AA.AA.AA.AA host-BB.domain IN A BB.BB.BB.BBLooking up the official host name can return either address. Looking up the interface-specific names returns the correct address for each interface.
Address-to-name lookups in DNS are handled using PTR records. The PTR records for both addresses should be configured to return the official name:
# address class type name AA.AA.AA.AA.in-addr.arpa IN PTR host.domain BB.BB.BB.BB.in-addr.arpa IN PTR host.domainIf it is not possible to change the system host name database, create the
hostsfile local to the LSF system, and configure entries for the multi-homed hosts only. Host names and addresses not found in thehostsfile are looked up in the standard name system on your host.[ Top ]
Host Groups
You can define a host group within LSF or use an external executable to retrieve host group members.
Use
bhoststo view a list of existing hosts. Usebmgroupto view host group membership use.Where to use host groups
LSF host groups can be used in defining the following parameters in LSF configuration files:
- HOSTS in
lsb.queuesfor authorized hosts for the queue- HOSTS in
lsb.hostsin theHostPartitionsection to list host groups that are members of the host partitionConfiguring host groups
- Log in as the LSF administrator to any host in the cluster.
- Open
lsb.hosts.- Add the
HostGroupsection if it does not exist.Begin HostGroup GROUP_NAME GROUP_MEMBER groupA (all) groupB (groupA ~hostA ~hostB) groupC (hostX hostY hostZ) groupD (groupC ~hostX) groupE (all ~groupC ~hostB) groupF (hostF groupC hostK) desk_tops (hostD hostE hostF hostG) Big_servers (!) End HostGroup- Enter a group name under the GROUP_NAME column.
External host groups must be defined in the
egroupexecutable.- Specify hosts in the GROUP_MEMBER column.
(Optional) To tell LSF that the group members should be retrieved using
egroup, put an exclamation mark (!) in the GROUP_MEMBER column.- Save your changes.
- Run
badmin ckconfigto check the group definition. If any errors are reported, fix the problem and check the configuration again.- Do one of the following:
External host group requirements (egroup)
An external host group is a host group for which membership is not statically configured, but is instead retrieved by running an external executable with the name
egroup. Theegroupexecutable must be in the directory specified by LSF_SERVERDIR.This feature allows a site to maintain group definitions outside LSF and import them into LSF configuration at initialization time.
The
egroupexecutable is an executable you create yourself that lists group names and hosts that belong to the group.This executable must have the name
egroup. Whenmbatchdis restarted, it invokes theegroupexecutable and retrieves groups and group members. The external executableegroupruns under the same account asmbatchd.The
egroupexecutable must write host names for the host groups to its standard output, each name separated by white space.The
egroupexecutable must recognize the following command, sincembatchdinvokes external host groups with this command:egroup -m host_group_namewhere host_group_name is the name of the host group defined in the executable
egroupalong with its members, and the host group is specified inlsb.hosts.[ Top ]
Tuning CPU Factors
CPU factors are used to differentiate the relative speed of different machines. LSF runs jobs on the best possible machines so that response time is minimized.
To achieve this, it is important that you define correct CPU factors for each machine model in your cluster.
How CPU factors affect performance
Incorrect CPU factors can reduce performance the following ways.
- If the CPU factor for a host is too low, that host may not be selected for job placement when a slower host is available. This means that jobs would not always run on the fastest available host.
- If the CPU factor is too high, jobs are run on the fast host even when they would finish sooner on a slower but lightly loaded host. This causes the faster host to be overused while the slower hosts are underused.
Both of these conditions are somewhat self-correcting. If the CPU factor for a host is too high, jobs are sent to that host until the CPU load threshold is reached. LSF then marks that host as busy, and no further jobs will be sent there. If the CPU factor is too low, jobs may be sent to slower hosts. This increases the load on the slower hosts, making LSF more likely to schedule future jobs on the faster host.
Guidelines for setting CPU factors
CPU factors should be set based on a benchmark that reflects your workload. If there is no such benchmark, CPU factors can be set based on raw CPU power.
The CPU factor of the slowest hosts should be set to 1, and faster hosts should be proportional to the slowest.
Consider a cluster with two hosts:
hostAandhostB. In this cluster,hostAtakes 30 seconds to run a benchmark andhostBtakes 15 seconds to run the same test. The CPU factor forhostAshould be 1, and the CPU factor ofhostBshould be 2 because it is twice as fast ashostA.Viewing normalized ratings
Run lsload -Nto display normalized ratings. LSF uses a normalized CPU performance rating to decide which host has the most available CPU power. Hosts in your cluster are displayed in order from best to worst. Normalized CPU run queue length values are based on an estimate of the time it would take each host to run one additional unit of work, given that an unloaded host with CPU factor 1 runs one unit of work in one unit of time.Tuning CPU factors
- Log in as the LSF administrator on any host in the cluster.
- Edit
lsf.shared, and change theHostModelsection:Begin HostModel MODELNAME CPUFACTOR ARCHITECTURE # keyword #HPUX (HPPA) HP9K712S 2.5 (HP9000712_60) HP9K712M 2.5 (HP9000712_80) HP9K712F 4.0 (HP9000712_100)See the Platform LSF Reference for information about the
lsf.sharedfile.- Save the changes to
lsf.shared.- Run
lsadmin reconfigto reconfigure LIM.- Run
badmin reconfigto reconfigurembatchd.
[ Top ]
Handling Host-level Job Exceptions
You can configure hosts so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected, and the corresponding actions. By default, LSF does not detect any exceptions.
eadmin script
When an exception is detected, LSF takes appropriate action by running the script
LSF_SERVERDIR/eadminon the master host. You can customizeeadminto suit the requirements of your site. For example,eadmincould find out the owner of the problem jobs and usebstop -uto stop all jobs that belong to the user.Host exceptions LSF can detect
If you configure exception handling, LSF can detect jobs that exit repeatedly on a host. The host can still be available to accept jobs, but some other problem prevents the jobs from running. Typically jobs dispatched to such "black hole", or "job-eating" hosts exit abnormally. LSF monitors the job exit rate for hosts, and closes the host if the rate exceeds a threshold you configure (EXIT_RATE in
lsb.hosts).By default, LSF invokes
eadminif the job exit rate for a host remains above the configured threshold for longer than 10 minutes. Use JOB_EXIT_RATE_DURATION inlsb.paramsto change how frequently LSF checks the job exit rate.Default eadmin actions
LSF closes the host and sends email to the LSF administrator. The email contains the host name, job exit rate for the host, and other host information. The message
eadmin: JOB EXIT THRESHOLD EXCEEDEDis attached to the closed host event inlsb.events, and displayed bybadmin histandbadmin hhist. Only one email is sent for host exceptions.Configuring host exception handling lsb.hosts)
Specifies a threshold for exited jobs. If the job exit rate is exceeded for 10 minutes or the period specified by JOB_EXIT_RATE_DURATION, LSF invokes
eadminto trigger a host exception.The following Host section defines a job exit rate of 20 jobs per minute for all hosts:
Begin Host HOST_NAME MXJ EXIT_RATE # Keywords Default ! 20 End HostConfiguring thresholds for exception handling
JOB_EXIT_RATE_DURATION (lsb.params)
By default, LSF checks the number of exited jobs every 10 minutes. Use JOB_EXIT_RATE_DURATION in
lsb.paramsto change this default.
Tune JOB_EXIT_RATE_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough.
![]()
In the diagram, the job exit rate of
hostAexceeds the configured threshold. LSF monitorshostAfrom time t1 to time t2 (t2=t1 + JOB_EXIT_RATE_DURATION inlsb.params). At t2, the exit rate is still high, and a host exception is detected. At t3 (EADMIN_TRIGGER_DURATION inlsb.params), LSF invokeseadminand the host exception is handled. By default, LSF closeshostAand sends email to the LSF administrator. SincehostAis closed and cannot accept any new jobs, the exit rate drops quickly.
[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: January 12, 2004
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.