[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- System Directories and Log Files
- Managing Error Logs
- System Event Log
- Duplicate Logging of Event Logs
[ Top ]
System Directories and Log Files
LSF uses directories for temporary work files, log files and transaction files and spooling.
LSF keeps track of all jobs in the system by maintaining a transaction log in the work subtree. The LSF log files are found in the directory
LSB_SHAREDIR/cluster_name/logdir.The following files maintain the state of the LSF system:
lsb.events
LSF uses the
lsb.eventsfile to keep track of the state of all jobs. Each job is a transaction from job submission to job completion. LSF system keeps track of everything associated with the job in thelsb.eventsfile.lsb.events.n
The events file is automatically trimmed and old job events are stored in
lsb.event.n files. Whenmbatchdstarts, it refers only to thelsb.eventsfile, not thelsb.events.n files. Thebhistcommand can refer to these files.Job script files in the info directory
When a user issues a
bsubcommand from a shell prompt, LSF collects all of the commands issued on the bsub line and spools the data tombatchd, which saves thebsubcommand script in the info directory for use at dispatch time or if the job is rerun. The info directory is managed by LSF and should not be modified by anyone.Log directory permissions and ownership
Ensure that the permissions on the LSF_LOGDIR directory to be writable by
root. The LSF administrator must own LSF_LOGDIR.Support for UNICOS accounting
In Cray UNICOS environments, LSF writes to the Network Queuing System (NQS) accounting data file,
nqacct, on the execution host. This lets you track LSF jobs and other jobs together, through NQS.Support for IRIX Comprehensive System Accounting (CSA)
The IRIX 6.5.9 Comprehensive System Accounting facility (CSA) writes an accounting record for each process in the
pacctfile, which is usually located in the/var/adm/acct/daydirectory. IRIX system administrators then use thecsabuildcommand to organize and present the records on a job by job basis.The LSF_ENABLE_CSA parameter in
lsf.confenables LSF to write job events to thepacctfile for processing through CSA. For LSF job accounting, records are written topacctat the start and end of each LSF job.See the Platform LSF Reference for more information about the LSF_ENABLE_CSA parameter.
See the IRIX 6.5.9 resource administration documentation for information about CSA.
[ Top ]
Managing Error Logs
Error logs maintain important information about LSF operations. When you see any abnormal behavior in LSF, you should first check the appropriate error logs to find out the cause of the problem.
LSF log files grow over time. These files should occasionally be cleared, either by hand or using automatic scripts.
Daemon error log
LSF log files are reopened each time a message is logged, so if you rename or remove a daemon log file, the daemons will automatically create a new log file.
The LSF daemons log messages when they detect problems or unusual situations.
The daemons can be configured to put these messages into files.
The error log file names for the LSF system daemons are:
lim.log.host_nameres.log.host_namepim.log.host_namesbatchd.log.host_namembatchd.log.host_namembschd.log.host_nameLSF daemons log error messages in different levels so that you can choose to log all messages, or only log messages that are deemed critical. Message logging is controlled by the parameter LSF_LOG_MASK in
lsf.conf. Possible values for this parameter can be any log priority symbol that is defined in/usr/include/sys/syslog.h. The default value for LSF_LOG_MASK is LOG_WARNING.Error logging
If the optional LSF_LOGDIR parameter is defined in
lsf.conf, error messages from LSF servers are logged to files in this directory.If LSF_LOGDIR is defined, but the daemons cannot write to files there, the error log files are created in
/tmp.If LSF_LOGDIR is not defined, errors are logged to the system error logs (
syslog) using the LOG_DAEMON facility.syslogmessages are highly configurable, and the default configuration varies widely from system to system. Start by looking for the file/etc/syslog.conf, and read the man pages forsyslog(3)andsyslogd(1).If the error log is managed by
syslog, it is probably already being automatically cleared.If LSF daemons cannot find
lsf.confwhen they start, they will not find the definition of LSF_LOGDIR. In this case, error messages go tosyslog. If you cannot find any error messages in the log files, they are likely in thesyslog.[ Top ]
System Event Log
The LSF daemons keep an event log in the
lsb.eventsfile. Thembatchddaemon uses this information to recover from server failures, host reboots, andmbatchdrestarts. Thelsb.eventsfile is also used by thebhistcommand to display detailed information about the execution history of batch jobs, and by thebadmincommand to display the operational history of hosts, queues, and daemons.By default,
mbatchdautomatically backs up and rewrites thelsb.eventsfile after every 1000 batch job completions. This value is controlled by the MAX_JOB_NUM parameter in thelsb.paramsfile. The oldlsb.eventsfile is moved tolsb.events.1, and each oldlsb.events.n file is moved tolsb.events.n+1. LSF never deletes these files. If disk storage is a concern, the LSF administrator should arrange to archive or remove oldlsb.events.n files periodically.Do not remove or modify the current
lsb.eventsfile. Removing or modifying thelsb.eventsfile could cause batch jobs to be lost.[ Top ]
Duplicate Logging of Event Logs
To recover from server failures, host reboots, or
mbatchdrestarts, LSF uses information stored inlsb.events. To improve the reliability of LSF, you can configure LSF to maintain copies of these logs, to use as a backup.If the host that contains the primary copy of the logs fails, LSF will continue to operate using the duplicate logs. When the host recovers, LSF uses the duplicate logs to update the primary copies.
How duplicate logging works
By default, the event log is located in
LSB_SHAREDIR. Typically,LSB_SHAREDIRresides on a reliable file server that also contains other critical applications necessary for running jobs, so if that host becomes unavailable, the subsequent failure of LSF is a secondary issue.LSB_SHAREDIRmust be accessible from all potential LSF master hosts.When you configure duplicate logging, the duplicates are kept on the file server, and the primary event logs are stored on the first master host. In other words,
LSB_LOCALDIRis used to store the primary copy of the batch state information, and the contents ofLSB_LOCALDIRare copied to a replica inLSB_SHAREDIR, which resides on a central file server. This has the following effects:
- Creates backup copies of
lsb.events- Reduces the load on the central file server
- Increases the load on the LSF master host
If the file server containing
LSB_SHAREDIRgoes down, LSF continues to process jobs. Client commands such asbhist, which directly readLSB_SHAREDIRwill not work.When the file server recovers, the current log files are replicated to
LSB_SHAREDIR.If the first master host fails, the primary copies of the files (in
LSB_LOCALDIR)become unavailable. Then, a new master host is selected. The new master host uses the duplicate files (inLSB_SHAREDIR)to restore its state and to log future events. There is no duplication by the second or any subsequent LSF master hosts.When the first master host becomes available after a failure, it will update the primary copies of the files (in
LSB_LOCALDIR) from the duplicates (in) and continue operations as before.If the first master host does not recover, LSF will continue to use the files in
LSB_SHAREDIR, but there is no more duplication of the log files.If the master host containing
LSB_LOCALDIRand the file server containingLSB_SHAREDIRboth fail simultaneously, LSF will be unavailable.We assume that Network partitioning does not cause a cluster to split into two independent clusters, each simultaneously running
mbatchd.This may happen given certain network topologies and failure modes. For example, connectivity is lost between the first master, M1, and both the file server and the secondary master, M2. Both M1 and M2 will run
mbatchdservice with M1 logging events toLSB_LOCALDIRand M2 logging toLSB_SHAREDIR. When connectivity is restored, the changes made by M2 toLSB_SHAREDIRwill be lost when M1 updatesLSB_SHAREDIRfrom its copy inLSB_LOCALDIR.The archived event files are only available on
LSB_LOCALDIR, so in the case of network partitioning, commands such asbhistcannot access these files. As a precaution, you should periodically copy the archived files fromLSB_LOCALDIRtoLSB_SHAREDIR.If NFS traffic is too high and you want to reduce network traffic, use EVENT_UPDATE_INTERVAL in
lsb.paramsto specify how often to back up the data and synchronize the LSB_SHAREDIR and LSB_LOCALDIR directories.The directories are always synchronized when data is logged to the files, or when
mbatchdis started on the first LSF master host.Automatic archiving and duplicate logging
Archived event logs,
lsb.events.n, are not replicated toLSB_SHAREDIR. If LSF starts a new event log while the file server containingLSB_SHAREDIRis down, you might notice a gap in the historical data inLSB_SHAREDIR.Configuring duplicate logging
To enable duplicate logging, set LSB_LOCALDIR in
lsf.confto a directory on the first master host (the first host configured inlsf.cluster.cluster_name) that will be used to store the primary copies oflsb.events. This directory should only exist on the first master host.
- Edit
lsf.confand set LSB_LOCALDIR to a local directory that exists only on the first master host.- Use the commands
lsadmin reconfigandbadmin mbdrestartto make the changes take effect.
[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: January 12, 2004
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.