Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Adding Resources


Contents

[ Top ]


About Configured Resources

LSF schedules jobs based on available resources. There are many resources built into LSF, but you can also add your own resources, and then use them same way as built-in resources.

For maximum flexibility, you should characterize your resources clearly enough so that users have satisfactory choices. For example, if some of your machines are connected to both Ethernet and FDDI, while others are only connected to Ethernet, then you probably want to define a resource called fddi and associate the fddi resource with machines connected to FDDI. This way, users can specify resource fddi if they want their jobs to run on machines connected to FDDI.

[ Top ]


Adding New Resources to Your Cluster

To add host resources to your cluster, use the following steps:

  1. Log in to any host in the cluster as the LSF administrator.
  2. Define new resources in the Resource section of lsf.shared. Specify at least a name and a brief description, which will be displayed to a user by lsinfo.

    See Configuring lsf.shared Resource Section.

  3. For static Boolean resources, for all hosts that have the new resources, add the resource name to the RESOURCES column in the Host section of lsf.cluster.cluster_name.
  4. For shared resources, for all hosts that have the new resources, associate the resources with the hosts (you might also have a reason to configure non-shared resources in this section).

    See Configuring lsf.cluster.cluster_name ResourceMap Section.

  5. Reconfigure your cluster.

[ Top ]


Configuring lsf.shared Resource Section

Configured resources are defined in the Resource section of lsf.shared. There is no distinction between shared and non-shared resources.

You must specify at least a name and description for the resource, using the keywords RESOURCENAME and DESCRIPTION.

You can also specify:

When the optional attributes are not specified, the resource is treated as static and Boolean.

Example


Begin Resource
RESOURCENAME TYPE  INTERVAL INCREASING DESCRIPTION
mips         Boolean  ()    ()        (MIPS architecture)
dec          Boolean  ()    ()        (DECStation system)
scratch      Numeric  30    N         (Shared scratch space on server)
synopsys     Numeric  30    N         (Floating licenses for Synopsys)
verilog      Numeric  30    N         (Floating licenses for Verilog)
console      String   30    N         (User Logged in on console)
End Resource

[ Top ]


Configuring lsf.cluster.cluster_name ResourceMap Section

Resources are associated with the hosts for which they are defined in the ResourceMap section of lsf.cluster.cluster_name.

For each resource, you must specify the name and the hosts that have it.

If the ResourceMap section is not defined, then any dynamic resources specified in lsf.shared are not tied to specific hosts, but are shared across all hosts in the cluster.

Example

A cluster consists of hosts host1, host2, and host3.

Begin ResourceMap
RESOURCENAME   LOCATION
verilog        (5@[all ~host1 ~host2])
synopsys       (2@[host1 host2] 2@[others])
console        (1@[host1] 1@[host2]1@[host3])
xyz            (1@[default])
End ResourceMap

In this example:

RESOURCENAME

The name of the resource, as defined in lsf.shared.

LOCATION

Defines the hosts that share the resource. For a static resource, you must define an initial value here as well. Do not define a value for a dynamic resource.

Possible states of a resource:

Syntax

([resource_value@][host_name... | all [~host_name]... | others | default] 
...)

Non-batch configuration

The following items should be taken into consideration when configuring resources under LSF Base.

[ Top ]


Static Shared Resource Reservation

You must use resource reservation to prevent over-committing static shared resources when scheduling.

The usual situation is that you configure single-user application licenses as static shared resources, and make that resource one of the job requirements. You should also reserve the resource for the duration of the job. Otherwise, LSF updates resource information, assumes that all the static shared resources can be used, and places another job that requires that license. The additional job cannot actually run if the license is already taken by a running job.

If every job that requests a license and also reserves it, LSF updates the number of licenses at the start of each new dispatch turn, subtracts the number of licenses that are reserved, and only dispatches additional jobs if there are licenses available that are not already in use.

Reserving a static shared resource

To indicate that a shared resource is to be reserved while a job is running, specify the resource name in the rusage section of the resource requirement string.

Example

You configured licenses for the Verilog application as a resource called verilog_lic. To submit a job that will run on a host when there is a license available:

% bsub -R "select[defined(verilog_lic)] rusage[verilog_lic=1]" 
myjob

If the job can be placed, the license it uses will be reserved until the job completes.

[ Top ]


External Load Indices and ELIM

The LSF Load Information Manager (LIM) collects built-in load indices that reflect the load situations of CPU, memory, disk space, I/O, and interactive activities on individual hosts.

While built-in load indices might be sufficient for most jobs, you might have special workload or resource dependencies that require custom external load indices defined and configured by the LSF administrator. Load and shared resource information from external load indices, are used the same as built in load indices for job scheduling and host selection.

You can write an External Load Information Manager (ELIM) program that collects the values of configured external load indices and updates LIM when new values are received.

An ELIM can be as simple as a small script, or as complicated as a sophisticated C program. A well-defined protocol allows the ELIM to talk to LIM.

The ELIM executable must be located in LSF_SERVERDIR.

How LSF supports multiple ELIMs

To increase LIM reliability, LSF Version 6.0 supports the configuration of multiple ELIM executables.

Master ELIM (melim)

A master ELIM (melim) is installed in LSF_SERVERDIR.

melim manages multiple site-defined sub-ELIMs (SELIMs) and reports external load information to LIM. melim does the following:

ELIM failure

Multiple slave ELIMs managed by a master ELIM increases reliability by protecting LIM:

Error logging

MELIM logs its own activities and data into the log file LSF_LOGDIR/melim.log.host_name.

Configuring your application-specific SELIM

The master ELIM is installed as LSF_SERVERDIR/melim. After installation:

  1. Define the external resources you need.
  2. Write your application-specific SELIM to track these resources, as described in Writing an ELIM.
  3. Put your ELIM in LSF_SERVERIR.

Naming your ELIM

Use the following naming conventions:

Existing ELIMs


Your existing ELIMs do not need to follow this convention and do not need to be renamed. However, since melim invokes any ELIM that follows this convention, you should move any backup copies of your ELIM out of LSF_SERVERDIR or choose a name that does not follow the convention (for example, use elim_bak instead of elim.bak).

elim.user is reserved

The name elim.user is reserved for backward compatibility. Do not use the name elim.user for your application-specific elim.

How LSF uses ELIM for external resource collection

The values of static external resources are specified through the lsf.cluster.cluster_name configuration file. The values of all dynamic resources, regardless of whether they are shared or host-based, are collected through an ELIM.

When an ELIM is started

An ELIM is started in the following situations:

There is only one ELIM on each host, regardless of the number of resources on which it reports. If only cluster-wide resources are to be collected, then an ELIM will only be started on the master host.

Environment variables

When LIM starts, the following environment variables are set for ELIM:

Writing an ELIM

The ELIM must be an executable program, either an interpreted script or compiled code.

ELIM output

The ELIM communicates with the LIM by periodically writing a load update string to its standard output. The load update string contains the number of indices followed by a list of name-value pairs in the following format:

number_indices [index_name index_value]...

For example,

3 tmp2 47.5 nio 344.0 licenses 5

This string reports three indices: tmp2, nio, and licenses, with values 47.5, 344.0, and 5 respectively. Index values must be numbers between -INFINIT_LOAD and INFINIT_LOAD as defined in the lsf.h header file.

If the ELIM is implemented as a C program, as part of initialization it should use setbuf(3) to establish unbuffered output to stdout.

The ELIM should ensure that the entire load update string is written successfully to stdout. This can be done by checking the return value of printf(3s) if the ELIM is implemented as a C program or as the return code of /bin/echo(1) from a shell script. The ELIM should exit if it fails to write the load information.

Each LIM sends updated load information to the master every 15 seconds. Depending on how quickly your external load indices change, the ELIM should write the load update string at most once every 15 seconds. If the external load indices rarely change, the ELIM can write the new values only when a change is detected. The LIM continues to use the old values until new values are received.

ELIM location

The executable for the ELIM must be in LSF_SERVERDIR.

Use the following naming conventions:

If LIM expects some resources to be collected by an ELIM according to configuration, it invokes the ELIM automatically on startup. The ELIM runs with the same user ID and file access permission as the LIM.

ELIM restart

The LIM restarts the ELIM if it exits; to prevent problems in case of a fatal error in the ELIM, it is restarted at most once every 90 seconds. When the LIM terminates, it sends a SIGTERM signal to the ELIM. The ELIM must exit upon receiving this signal.

Example

  1. Write an ELIM.

    The following sample ELIM (LSF_SERVERDIR/elim.mysrc) sets the value of myrsc resource to 2. In a real ELIM, you would have a command to retrieve whatever value you want to retrieve and set the value.

    #!/bin/sh
    while :
    do
       # set the value for resource "myrsc"
       val=2
    
       # create an output string in the format:
       # number_indices index1_name index1_value...
    
       reportStr="1 myrsc $val"
       echo "$reportStr"
    
       # wait for 30 seconds before reporting again
       sleep 30
    done
    
  2. Test this ELIM by running it from the command line.
    % ./elim.myrsc
    

    It should give you the output:

    1 myrsc 2
    
  3. Copy the ELIM to LSF_SERVERDIR and make sure it has the name elim.myrsrc.
  4. Define the myrsc resource in lsf.shared.

    In this case, we are defining the resource as Numeric because we want it to accept numbers. The value does not increase with load.

    Begin Resource
    RESOURCENAME   TYPE   INTERVAL INCREASING DESCRIPTION
    myrsc         Numeric  30        N     (custom resource to trigger elim to 
    start up)
    End Resource
    
  5. Map the myrsc resource to hosts in lsf.cluster.cluster_name. In this case, we want this resource to reside only on hostA.
    Begin ResourceMap
    RESOURCENAME         LOCATION
    myrsc               [hostA]
    End ResourceMap
    
  6. Reconfigure LSF with the commands:
    • lsadmin reconfig
    • badmin mbdrestart
  7. Display the resource with the command lsload -l. You should be able to see the new resource and value:
    HOST_NAME    status  r15s  r1m  r15m  ut  pg  io  ls  it  tmp  swp  mem  myrsc
    hostA         ok      0.4   0.4  0.4   0%  0.0  0   22  0   24M  26M  6M   2
    

Additional examples

Example code for an ELIM is included in the LSF_MISC/examples directory. The elim.c file is an ELIM written in C. You can modify this example to collect the load indices you want.

Debugging an ELIM

Set the parameter LSF_ELIM_DEBUG=y in the Parameters section of lsf.cluster.cluster_name to log all load information received by LIM from the ELIM in the LIM log file.

Set the parameter LSF_ELIM_BLOCKTIME=seconds in the Parameters section of lsf.cluster.cluster_name to configure how long LIM waits before restarting the ELIM.

Use the parameter LSF_ELIM_RESTARTS=integer in the Parameters section of lsf.cluster.cluster_name to limit the number of times an ELIM can be restarted.

See the Platform LSF Reference for more details on these parameters.

[ Top ]


Modifying a Built-In Load Index

The ELIM can return values for the built-in load indices. In this case the value produced by the ELIM overrides the value produced by the LIM.

Considerations

Steps

For example, some sites prefer to use /usr/tmp for temporary files.

To override the tmp load index:

  1. Write a program that periodically measures the space in the /usr/tmp file system and writes the value to standard output. For details on format, see Writing an ELIM.

    For example, the program writes to its standard output:

    1 tmp 47.5
    
  2. Name the program elim and store it in the LSF_SERVERDIR directory.

    All default load indices are local resources, so the elim must run locally on every machine.

  3. Define the resource.

    Since the name of built-in load indices is not allowed in lsf.shared, define a custom resource to trigger the elim.

    For example:

    Begin Resource
    RESOURCENAME   TYPE   INTERVAL INCREASING DESCRIPTION
    my_tmp         Numeric  30        N     (custom resource to trigger elim to 
    start up)
    End Resource
    
  4. Map the resource to hosts in lsf.cluster.cluster_name.
    • To override the tmp load index on every host, specify the keyword default:
      Begin ResourceMap
      RESOURCENAME         LOCATION
      my_tmp               [default]
      End ResourceMap
      
    • To override the tmp load index only on specific hosts, specify the host names:
      Begin ResourceMap
      RESOURCENAME         LOCATION
      my_tmp               ([host1][host2][host3])
      End ResourceMap
      

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: January 12, 2004
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.