[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- About This Guide
- What's New in the Platform LSF Version 6.0
- Upgrade and Compatibility Notes
- Learning About Platform Products
- Technical Support
[ Top ]
About This Guide
Purpose of this guide
This guide describes how to manage and configure Platform LSF® software ("LSF"). In it, you will find information to do the following:
- Configure and maintain your cluster
- Configure and manage queues, hosts, and users
- Run jobs and control job execution
- Understand and work with resources
- Understand and configure scheduling policies
- Manage job scheduling and dispatch
Who should use this guide
This guide is intended for Platform LSF cluster administrators who need to implement business policies in LSF. Users who want more in-depth understanding of advanced details of LSF operation should also read this guide. Users who simply want to run and monitor their jobs should read Running Jobs with Platform LSF.
What you should already know
This guide assumes:
- You have knowledge of system administration tasks such as creating user accounts, sharing and mounting Network File System (NFS) partitions, and backing up the system
- You are familiar with basic LSF concepts and basic LSF operations
Typographical conventions
Command notation
[ Top ]
What's New in the Platform LSF Version 6.0
Platform LSF Version 6.0 introduces the following new features:
- Policy management
- Job group support
- High Performance Computing
- Administration and diagnosis
- Run-time enhancements
Policy management
Goal-oriented SLA-driven scheduling policies help you configure your workload so that your jobs are completed on time and reduce the risk of missed deadlines:
- They enable you to focus on the "what and when" of your projects, not the low-level details of "how" resources need to be allocated to satisfy various workloads.
- They define a "just-in-time" service-level agreement between LSF administrators and LSF users.
You implement your SLA scheduling policies in service classes associated with your projects and users. Each service class defines how many jobs should be run to meet different kinds of goals:
- Deadline goals--A specified number of jobs should be completed within a specified time window. For example, run all jobs submitted over a weekend.
- Velocity goals--Expressed as concurrently running jobs. For example: maintain 10 running jobs between 9:00 a.m. and 5:00 p.m. Velocity goals are well suited for short jobs (run time less than one hour). Such jobs leave the system quickly, and configuring a velocity goal ensures a steady flow of jobs through the system.
- Throughput goals--Expressed as number of finished jobs per hour. For example: finish 15 jobs per hour between the hours of 6:00 p.m. and 7:00 a.m. Throughput goals are suitable for medium to long running jobs. These jobs stay longer in the system, so you typically want to control their rate of completion rather than their flow.
You use the
bslacommand to track the progress of your projects and see whether they are meeting the goals of your policy.See Goal-Oriented SLA-Driven Scheduling for more information.
Platform LSF License Scheduler ensures that higher priority work never has to wait for a license. Prioritized sharing of application licenses allows you to make policies that control the way software licenses are shared among different users in your organization.
You configure your software license distribution policy and LSF intelligently allocates licenses to improve quality of service to your end users while increasing throughput of high-priority work and reducing license costs.
It has the following features:
- Applies license distribution policies fairly among multiple projects cluster-wide
- Easily configurable distribution policies; instead of assigning equal share of licenses to everyone, you can give more licenses to larger or more important projects
- Guaranteed access to a minimum portion of licenses, no matter how heavily loaded the system is
- Controls the distribution of licenses among jobs and tasks it manages and still allows users to check out licenses directly
- Preempts lower priority jobs and releases their licenses to allow higher priority jobs to get the license and run.
- Provides visibility of license usage with
bluserscommandSee Using Platform LSF License Scheduler for installation and configuration instructions.
Platform LSF license-aware scheduling is available as separately installable add-on packages located in
/license_scheduler/on the Platform FTP site (ftp.platform.com/).Configure hosts and queues so that LSF takes appropriate action automatically when it detects exceptional conditions while jobs are running. Customize what exceptions are detected, and their corresponding actions.
LSF detects:
- Job exceptions:
- Host exceptions:
- LSF detects "black hole" or "job-eating" hosts. LSF monitors the job exit rate for hosts, and closes the host if the rate exceeds a threshold you configure.
- A host can still be available to accept jobs, but some other problem prevents the jobs from running. Typically jobs dispatched to such problem hosts exit abnormally.
See Working with Hosts for more information.
Prevents starvation of low-priority work and ensures high-priority jobs get the resources they require by sharing resources among queues. Queue-based fairshare extends your existing user- and project-based fairshare policies by enabling flexible slot allocation per queue based on slot share units you configure.
See Fairshare Scheduling for more information.
Improves control of user-based fairshare by taking queue priority into account for dispatching jobs from different queues against the same user fairshare policy. Within the queue, dispatch order is based on share quota.
See Fairshare Scheduling for more information.
Job group support
Use LSF job groups to organize and control a collection of individual jobs in higher level work units for easy management. A job group is a container for jobs in much the same way that a directory in a file system is a container for files. For example, you can organize jobs around groups that are meaningful to your business: a payroll application may have one group of jobs that calculates weekly payments, another job group for calculating monthly salaries, and a third job group that handles the salaries of part-time or contract employees.
Jobs groups increase end-user productivity by reducing complexity:
- Submit, view, and control jobs according to their groups rather than looking at individual jobs
- Create job group hierarchies
- Move jobs in and out of job groups as needed
- Kill, stop resume and send job control actions to entire job groups
- View job status by group
See Managing Jobs for more information.
High Performance Computing
Parallel jobs now have a flexible choice of the number of CPUs in the different kinds of hosts in a heterogeneous cluster.
Improves the performance and throughput of parallel jobs by setting multiple
ptilevalues in aspanstring according to the CPU configuration of the host type or model.You can specify various
ptilevalues in the queue (RES_REQ inlsb.queues, or at job submission withbsub -R):
- Default
ptilevalue, specified by n processors. For example:span[ptile=4]LSF allocates 4 processors on each available host, regardless of how many processors the host has.
- Predefined
ptilevalue, specified by '!'. For example:span[ptile='!']LSF uses the predefined maximum job slot limit in
lsb.hosts(MXJ per host type/model) as its value.- Predefined
ptilevalue with optional multipleptilevalues, per host type or host model. For example:span[ptile='!',HP:8,SGI:8,LINUX:2] same[type]The job requests 8 processors on a host of type HP or SGI, and 2 processors on a host of type LINUX, and the predefined maximum job slot limit in
lsb.hosts(MXJ) for other host types.See Specifying Resource Requirements for more information.
You no longer need to specify a host list manually for your advance reservations. Specify a resource requirement string with the
-Roption ofbrsvaddinstead of or in addition to a list of hosts. This makes advance reservation specification more flexible by reserving host slots based on your specific resource requirements. Only hosts that satisfy the resource requirement expression are reserved.See Advance Reservation for more information.
Administration and diagnosis
Enables dynamic debugging of the LSF scheduler daemon (
mbschd) without reconfiguring the cluster. Administrators no longer need to runbadmin mbdrestartto debug the LSF scheduler:badmin schddebug [-c class_name] [-l debug_level] [-f logfile_name] [-o] badmin schdtime [-l timing_level] [-f logfile_name] [-o]See Troubleshooting and Error Messages for more information.
Improves communication of LSF status to users. Users know the reason for the administrator actions, and administrators can easily communicate actions to users.
Administrators can attach a message to
mbatchdrestart, and host and queue operations:
- Use the
-Coption ofbadmin mbdrestartto log an administrator comment inlsb.events. For example,% badmin mbdrestart -C "Configuration change"The comment text
Configuration changeis recorded inlsb.events.- Use the
-Coption ofbadmin hcloseandbadmin hopento log an administrator comment inlsb.events. For example,% badmin hclose -C "Weekly backup" hostBThe comment text
Weekly backupis recorded inlsb.events. If you close or open a host group, each host group member displays the same comment string.- Use the
-Coption ofbadminqueue commandsqclose,qopen,qact, andqinactto log an administrator comment inlsb.events. For example,% badmin qclose -C "change configuration" normalThe comment text
change configurationis recorded inlsb.events.To see administrator comments, users run
badmin hist,badmin mbdhist,badmin hhist, orbadmin qhist.See Working with Your Cluster, Working with Hosts, and Working with Queues for more information.
Understand cluster operations better, so that you can improve performance and troubleshoot configuration problems.
Platform LSF Reports provides a lightweight reporting package for single LSF clusters. It provides simple two-week reporting for smaller LSF clusters (about 100 hosts, 1,000 jobs/day) and shows trends for basic cluster metrics by user, project, host, resource and queue.
LSF Reports provides the following historical information about a cluster:
- Cluster load
Trends the LSF internal load indices: status, r15s, r1m, r15m, ut, pg, ls, it, swp, mem, tmp, and io.
- Cluster service level
Shows the average cluster service level using the following metrics: CPU time, memory and swap consumption, job runtime, job pending time, and job turnaround time
- Cluster throughput
Shows the amount of work pushed through the cluster, using both accounting information (total number of submitted, completed, and exited jobs) and sampled information (the minimum, maximum, and average number of running and pending jobs, by state and type).
- Shared resource usage
Shows the total, free, and used shared resources for the cluster.
- Reserved resource usage
Shows the actual usage of reserved resources.
- License usage
Shows peak, average, minimum, and maximum license usage by feature.
- License consumption
Shows license minutes consumed by user, feature, vendor, and server.
See Platform LSF Reports Reference for installation and configuration instructions.
Platform LSF Reports is available as separately installable add-on packages located in
/lsf_reports/on the Platform FTP site (ftp.platform.com/).Run-time enhancements
Control job thread limit like other limits. Use
bsub -Tto set the limit of the number of concurrent threads for the whole job. The default is no limit. In the queue, set THREADLIMIT to limit the number of concurrent threads that can be part of a job. Exceeding the limit causes the job to terminate.See Runtime Resource Usage Limits for more information.
Presents consistent job run time limits no matter which host runs the job. With non-normalized job run limit configured, job run time is not normalized by CPU factor.
If ABS_RUNLIMIT=Y is defined in
lsb.params, the run time limit is not normalized by the host CPU factor. Absolute wall-clock run time is used for all jobs submitted with a run limit.See Runtime Resource Usage Limits for more information.
Improves visibility to resource allocation limits. If your job is pending because some configured resource allocation limit has been reached, you can find out what limits may be blocking your job.
Use the
blimitscommand to show the dynamic counters of each resource allocation limit configured inlsb.resources.See Resource Allocation Limits for more information.
[ Top ]
Upgrade and Compatibility Notes
UPGRADE document
To upgrade to LSF Version 6.0, follow the steps in
upgrade.html.API Compatibility between LSF 5.x and Version 6.0
Full backward compatibility: your applications will run under LSF Version 6.0 without changing any code.
The Platform LSF Version 6.0 API is fully compatible with the LSF Version 5.x and Version 4.x API. An application linked with the LSF Version 5.x and Version 4.x library will run under LSF Version 6.0 without relinking.
To take full advantage of new Platform LSF Version 6.0 features, you should recompile your existing LSF applications with LSF Version 6.0.
Server host compatibility Platform LSF
You must upgrade the LSF master hosts in your cluster to Version 6.0.
LSF 5.x servers are compatible with Version 6.0 master hosts. All LSF 5.x features are supported by 6.0 master hosts except:
To use new features introduced in Platform LSF Version 6.0, you must upgrade all hosts in your cluster to 6.0.
Platform LSF MultiCluster
You must upgrade the LSF master hosts in all clusters to Version 6.0.
New configuration parameters and environment variables
The following new parameters and environment variables have been added for LSF Version 6.0:
EXIT_RATE specifies a threshold in minutes for exited jobs
- EADMIN_TRIGGER_DURATION defines how often
LSF_SERVERDIR/eadminis invoked once a job exception is detected.- JOB_EXIT_RATE_DURATION defines how long LSF waits before checking the job exit rate for a host.
- ABS_RUNLIMIT--if set, the run time limit specified by the
-Woption ofbsub, or the RUNLIMIT queue parameter inlsb.queuesis not normalized by the host CPU factor. Absolute wall-clock run time is used for all jobs submitted with a run limit.
- DISPATCH_ORDER defines an ordered cross-queue fairshare set
- JOB_IDLE specifies a threshold for idle job exception handling
- JOB_OVERRUN specifies a threshold for job overrun exception handling
- JOB_UNDERRUN specifies a threshold for job underrun exception handling
- RES_REQ accepts multiple ptile specifications in the span section for dyamic ptile enforcement
- SLOT_POOL is the name of the pool of job slots the queue belongs to for queue-based fairshare
- SLOT_SHARE specifies the share of job slots for queue-based fairshare, representing the percentage of running jobs (job slots) in use from the queue
- THREADLIMIT limits the number of concurrent threads that can be part of a job. Exceeding the limit causes the job to terminate
- RUNLIMIT--if ABS_RUNLIMIT=Y is defined in
lsb.params, the run time limit is not normalized by the host CPU factor. Absolute wall-clock run time is used for all jobs submitted to a queue with a run limit configured.
- LSB_SUB_EXTSCHED_PARAM
Value of external scheduling options specified by
bsub -extsched, or queue-level MANDATORY_EXTSCHED or DEFAULT_EXTSCHED- LSB_SUB_JOB_WARNING_ACTION
Value of job warning action specified by
bsub -wa- LSB_SUB_JOB_WARNING_TIME_PERIOD
Value of job warning time period specified by
bsub -wtNew command options and output
The following command options and output have changed for LSF Version 6.0:
-slaservice_class_name displays accounting statistics for jobs that ran under the specified service class-xdisplays jobs that have triggered a job exception (overrun, underrun, idle)
schddebugsets message log level formbschdto include additional information in log filesschdtimesets timing level formbschdto include additional timing information in log files-Ccomment logs the text of comment as an administrator comment record tolsb.eventsfor the following subcommands:
-ldisplays:
-xdisplays hosts whose job exit rate has exceeded the threshold configured by EXIT_RATE inlsb.hostsfor longer than JOB_EXIT_RATE_DURATION configured inlsb.params, and are still high-ldisplays the comment text if the LSF administrator specified an administrator comment with the-Coption of thebadminhost control commandshcloseor hopen
-gjob_group_name displays information about jobs attached to the specified job group-ldisplays the thread limit for the job-slaservice_class_name displays jobs belonging to the specified service class-xdisplays unfinished jobs that have triggered a job exception (overrun, underrun, idle)
-gjob_group_name operates only on jobs in the specified job group-slaservice_class_name operates on jobs belonging to the specified service class.
-ldisplays:
- Configured job exception thresholds and number of jobs in each exception state for the queue
- The job slot share (SLOT_SHARE) and the name of the share pool (SLOT_POOL) that the queue belongs to for queue-based fairshare
- DISPATCH_ORDER in a master queue for cross-queue fairshare
- The comment text if the LSF administrator specified an administrator comment with the
-Coption of the queue control commandsqclose,qopen,qact, andqinact,qhist
-gjob_group_name resumes only jobs in the specified job group
-Rselects hosts for the reservation according to the specified resource requirements
-gjob_group_name suspends only jobs in the specified job group-slaservice_class_name suspends jobs belonging to the specified service class
-gjob_group_name submits jobs in the specified job group-Raccepts multiple ptile specifications in the span section for dyamic ptile enforcement-slaservice_class_name specifies the service class where the job is to run-Tthread_limit sets the limit of the number of concurrent threads to thread_limit for the whole job.-W--if ABS_RUNLIMIT=Y is defined inlsb.params, the run time limit is not normalized by the host CPU factor. Absolute wall-clock run time is used for all jobs submitted with a run limit.New files added to installation
The following new files have been added to the Platform LSF Version 6.0 installation:
LSB_CONFDIR/cluster_name/configdir/lsb.serviceclassesLSF_BINDIR/bgaddLSF_BINDIR/bgdelLSF_BINDIR/bjgroupLSF_BINDIR/blimitsLSF_BINDIR/bslaLSF_SERVERDIR/eadminLSF_LIBDIR/schmod_jobweight.so
If your installation uses symbolic links to other files in these directories, you must manually create links to these new files.
New accounting and job event fields
The following fields have been added to
lsb.acctandlsb.events:
- JOB_NEW:
- JOB_MODIFY2:
- JOB_EXECUTE:
SLAscaledRunLimit (%d) is the run time limit for the job scaled by the execution host
- QUEUE_CTRL:
ctrlComments (%s) is the administrator comment text from the
-Coption ofbadminqueue control commandsqclose,qopen,qact, andqinact- HOST_CTRL:
ctrlComments (%s) is the administrator comment text from the
-Coption ofbadminhost control commandshcloseandhopen- MBD_DIE:
ctrlComments (%s) is the administrator comment text from the
-Coption ofbadmin mbdrestart[ Top ]
Learning About Platform Products
World Wide Web and FTP
The latest information about all supported releases of Platform LSF is available on the Platform Web site at
www.platform.com. Look in the Online Support area for current README files, Release Notes, Upgrade Notices, Frequently Asked Questions (FAQs), Troubleshooting, and other helpful information.The Platform FTP site (
ftp.platform.com) also provides current README files, Release Notes, and Upgrade information for all supported releases of Platform LSF.Visit the Platform User Forum at
www.platformusers.netto discuss workload management and strategies pertaining to distributed and Grid Computing.If you have problems accessing the Platform web site or the Platform FTP site, contact
support@platform.com.Platform training
Platform's Professional Services training courses can help you gain the skills necessary to effectively install, configure and manage your Platform products. Courses are available for both new and experienced users and administrators at our corporate headquarters and Platform locations worldwide.
Customized on-site course delivery is also available.
Find out more about Platform Training at
www.platform.com/training, or contactTraining@platform.comfor details.README files and release notes and UPGRADE
Before installing LSF, be sure to read the files named
readme.htmlandrelease_notes.html. To upgrade to Version 6.0, follow the steps inupgrade.html.You can also view these files from the Download area of the Platform Online Support Web page.
Platform documentation
Documentation for Platform products is available in HTML and PDF format on the Platform Web site at
www.platform.com/services/support/docs_home.asp.[ Top ]
Technical Support
Contact Platform Computing or your LSF vendor for technical support.
1-877-444-4LSF (+1 877 444 4573)
Platform Support
Platform Computing Corporation
3760 14th Avenue
Markham, Ontario
Canada L3R 3T7When contacting Platform, please include the full name of your company.
We'd like to hear from you
If you find an error in any Platform documentation, or you have a suggestion for improving it, please let us know:
Information Development
Platform Computing Corporation
3760 14th Avenue
Markham, Ontario
Canada L3R 3T7Be sure to tell us:
- The title of the manual you are commenting on
- The version of the product you are using
- The format of the manual (HTML or PDF)
[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: January 12, 2004
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.