Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Working with Your Cluster


Contents

[ Top ]


Viewing Cluster Information

LSF provides commands for users to get information about the cluster. Cluster information includes the cluster master host, cluster name, cluster resource definitions, cluster administrator, and so on.

To view the ... Run ...
Version of LSF
lsid
Cluster name
lsid
Current master host
lsid
Cluster administrators
lsclusters
Configuration parameters
bparams

Viewing LSF version, cluster name, and current master host

Use the lsid command to display the version of LSF, the name of your cluster, and the current master host:

lsid
Platform LSF 6.0, Oct 31 2003
Copyright 1992-2004 Platform Computing Corporation

My cluster name is cluster1
My master name is hostA

Viewing cluster administrators

Use the lsclusters command to find out who your cluster administrator is and see a summary of your cluster:

% lsclusters
CLUSTER_NAME   STATUS   MASTER_HOST    ADMIN        HOSTS     SERVERS
cluster1       ok       hostA          lsfadmin     6      6

If you are using the LSF MultiCluster product, you will see one line for each of the clusters that your local cluster is connected to in the output of lsclusters.

Viewing configuration parameters

Use the bparams command to display the generic configuration parameters of LSF. These include default queues, default host or host model for CPU speed scaling, job dispatch interval, job checking interval, job accepting interval, etc.

bparams
Default Queues:  normal idle
Default Host Specification:  DECAXP
Job Dispatch Interval:  20 seconds
Job Checking Interval:  15 seconds
Job Accepting Interval:  20 seconds

Use the -l option of bparams to display the information in long format, which gives a brief description of each parameter as well as the name of the parameter as it appears in lsb.params.

bparams -l

System default queues for automatic queue selection:
    DEFAULT_QUEUE = normal idle

The interval for dispatching jobs by master batch daemon:
    MBD_SLEEP_TIME = 20 (seconds)

The interval for checking jobs by slave batch daemon:
    SBD_SLEEP_TIME = 15 (seconds)

The interval for a host to accept two batch jobs subsequently:
    JOB_ACCEPT_INTERVAL = 1 (* MBD_SLEEP_TIME)

The idle time of a host for resuming pg suspended jobs:
    PG_SUSP_IT = 180 (seconds)

The amount of time during which finished jobs are kept in core:
    CLEAN_PERIOD = 3600 (seconds)

The maximum number of finished jobs that are logged in current event file:
    MAX_JOB_NUM = 2000

The maximum number of retries for reaching a slave batch daemon:
    MAX_SBD_FAIL = 3

The number of hours of resource consumption history:
    HIST_HOURS = 5

The default project assigned to jobs.
    DEFAULT_PROJECT = default

[ Top ]


Default Directory Structures

UNIX

The following diagram shows a typical directory structure for a new UNIX installation. Depending on which products you have installed and platforms you have selected, your directory structure may vary.

Pre-4.2 UNIX installation directory structure

The following diagram shows a cluster installed with lsfsetup. It uses the pre-4.2 directory structure.

Windows

The following diagram shows the directory structure for a default Windows installation.

[ Top ]


Cluster Administrators

Primary cluster administrator

Required. The first cluster administrator, specified during installation. The primary LSF administrator account owns the configuration and log files. The primary LSF administrator has permission to perform clusterwide operations, change configuration files, reconfigure the cluster, and control jobs submitted by all users.

Cluster administrators

Optional. May be configured during or after installation.

Cluster administrators can perform administrative operations on all jobs and queues in the cluster. Cluster administrators have the same cluster-wide operational privileges as the primary LSF administrator except that they do not have permission to change LSF configuration files.

Adding cluster administrators

  1. In the ClusterAdmins section of lsf.cluster.cluster_name, specify the list of cluster administrators following ADMINISTRATORS, separated by spaces. The first administrator in the list is the primary LSF administrator. All others are cluster administrators. You can specify user names and group names. For example:
    Begin ClusterAdmins
    ADMINISTRATORS = lsfadmin admin1 admin2
    End ClusterAdmins
    
  2. Save your changes.
  3. Run lsadmin reconfig to reconfigure LIM.
  4. Run badmin mbdrestart to restart mbatchd.

[ Top ]


Controlling Daemons

Prerequisites

To control all daemons in the cluster, you must:

Daemon commands

The following is an overview of commands you use to control LSF daemons.
Daemon Action Command Permissions
All in cluster
Start
lsfstartup
Must be root or a user listed in lsf.sudoers for all these commands

Shut down
lsfshutdown

sbatchd
Start
badmin hstartup [host_name ...|all]
Must be root or a user listed in lsf.sudoers for the startup command

Restart
badmin hrestart [host_name ...|all]
Must be root or the LSF administrator for other commands.

Shut down
badmin hshutdown [host_name ...|all]
mbatchd
mbschd
Restart
badmin mbdrestart
Must be root or the LSF administrator for these commands

Shut down
  1. badmin hshutdown
  2. badmin mbdrestart


Reconfigure
badmin reconfig

RES
Start
lsadmin resstartup [host_name ...|all]
Must be root or a user listed in lsf.sudoers for the startup command

Shut down
lsadmin resshutdown [host_name ...|all]
Must be the LSF administrator for other commands

Restart
lsadmin resrestart [host_name ...|all]
LIM
Start
lsadmin limstartup [host_name ...|all]
Must be root or a user listed in lsf.sudoers for the startup command

Shut down
lsadmin limshutdown [host_name ...|all]
Must be the LSF administrator for other commands

Restart
lsadmin limrestart [host_name ...|all]

Restartall
in cluster
lsadmin reconfig

sbatchd

Restarting sbatchd on a host does not affect jobs that are running on that host.

If sbatchd is shut down, the host is not available to run new jobs. Existing jobs running on that host continue, but the results are not sent to the user until sbatchd is restarted.

LIM and RES

Jobs running on the host are not affected by restarting the daemons.

If a daemon is not responding to network connections, lsadmin displays an error message with the host name. In this case you must kill and restart the daemon manually.

If the LIM on the current master host is shut down, another host automatically takes over as master.

If the RES is shut down while remote interactive tasks are running on the host, the running tasks continue but no new tasks are accepted.

[ Top ]


Controlling mbatchd

When you reconfigure the cluster with the command badmin reconfig, mbatchd is not restarted. Only configuration files are reloaded.

If you add a host to a host group, or a host to a queue, the new host is not recognized by jobs that were submitted before you reconfigured. If you want the new host to be recognized, you must restart mbatchd.

Restarting mbatchd

Run badmin mbdrestart. LSF checks configuration files for errors and prints the results to stderr. If no errors are found, the following occurs:

Whenever mbatchd is restarted, it is unavailable to service requests. In large clusters where there are many events in lsb.events, restarting mbatchd can take some time. To avoid replaying events in lsb.events, use the command badmin reconfig.

Logging a comment when restarting mbatchd

Use the -C option of badmin mbdrestart to log an administrator comment in lsb.events. For example,

% badmin mbdrestart -C "Configuration change"

The comment text Configuration change is recorded in lsb.events.

Use badmin hist or badmin mbdhist to display administrator comments for mbatchd restart.

Shutting down mbatchd

  1. Run badmin hshutdown to shut down sbatchd on the master host. For example:
    % badmin hshutdown hostD
    Shut down slave batch daemon on <hostD> .... done
    
  2. Run badmin mbdrestart:
    % badmin mbdrestart
    Checking configuration files ...
    No errors found.
    

    This causes mbatchd and mbschd to exit. mbatchd cannot be restarted, because sbatchd is shut down. All LSF services are temporarily unavailable, but existing jobs are not affected. When mbatchd is later started by sbatchd, its previous status is restored from the event log file and job scheduling continues.

[ Top ]


Reconfiguring Your Cluster

After changing LSF configuration files, you must tell LSF to reread the files to update the configuration. The commands you can use to reconfigure a cluster are:

The reconfiguration commands you use depend on which files you change in LSF. The following table is a quick reference.

After making changes to ... Use ... Which ...
hosts
badmin reconfig
reloads configuration files
lsb.hosts
badmin reconfig
reloads configuration files
lsb.modules
badmin reconfig
reloads configuration files
lsb.nqsmaps
badmin reconfig
reloads configuration files
lsb.params
badmin reconfig
reloads configuration files
lsb.queues
badmin reconfig
reloads configuration files
lsb.resources
badmin reconfig
reloads configuration files
lsb.users
badmin reconfig
reloads configuration files
lsf.cluster.cluster_name
lsadmin reconfig AND badmin mbdrestart
reconfigures LIM, reloads configuration files, and restarts mbatchd
lsf.conf
lsadmin reconfig AND badmin mbdrestart
reconfigures LIM and reloads configuration files, and restarts mbatchd
lsf.shared
lsadmin reconfig AND badmin mbdrestart
reconfigures LIM, reloads configuration files, and restarts mbatchd
lsf.sudoers
badmin reconfig
reloads configuration files
lsf.task
lsadmin reconfig AND badmin reconfig
reconfigures LIM and reloads configuration files

Reconfiguring the cluster with lsadmin and badmin

  1. Log on to the host as root or the LSF administrator.
  2. Run lsadmin reconfig to reconfigure LIM:
    % lsadmin reconfig
    Checking configuration files ...
    No errors found.
    Do you really want to restart LIMs on all hosts? [y/n] y
    Restart LIM on <hosta> ...... done
    Restart LIM on <hostc> ...... done
    Restart LIM on <hostd> ...... done
    

    The lsadmin reconfig command checks for configuration errors.

    If no errors are found, you are asked to confirm that you want to restart lim on all hosts and lim is reconfigured. If fatal errors are found, reconfiguration is aborted.

  3. Run badmin reconfig to reconfigure mbatchd:
    % badmin reconfig
    Checking configuration files ...
    No errors found.
    Do you want to reconfigure? [y/n] y
    Reconfiguration initiated
    

    The badmin reconfig command checks for configuration errors.

    If no fatal errors are found, you are asked to confirm reconfiguration. If fatal errors are found, reconfiguration is aborted.

Reconfiguring the cluster by restarting mbatchd

Run badmin mbdrestart to restart mbatchd:

% badmin mbdrestart
Checking configuration files ...
No errors found.
Do you want to restart? [y/n] y
MBD restart initiated

The badmin mbdrestart command checks for configuration errors.

If no fatal errors are found, you are asked to confirm mbatchd restart. If fatal errors are found, the command exits without taking any action.


If the lsb.events file is large, or many jobs are running, restarting mbatchd can take some time. In addition, mbatchd is not available to service requests while it is restarted.

Viewing configuration errors

You can view configuration errors by using the following commands:

This reports all errors to your terminal.

How reconfiguring the cluster affects licenses

If the license server goes down, LSF can continue to operate for a period of time until it attempts to renew licenses.

Reconfiguring causes LSF to renew licenses. If no license server is available, LSF will not reconfigure the system because the system would lose all its licenses and stop working.

If you have multiple license servers, reconfiguration will proceed as long as LSF can contact at least one license server. In this case, LSF will still lose the licenses on servers that are down, so LSF may have fewer licenses available after reconfiguration.

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: January 12, 2004
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.