Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Error and Event Logging


Contents

[ Top ]


System Directories and Log Files

LSF uses directories for temporary work files, log files and transaction files and spooling.

LSF keeps track of all jobs in the system by maintaining a transaction log in the work subtree. The LSF log files are found in the directory LSB_SHAREDIR/cluster_name/logdir.

The following files maintain the state of the LSF system:

lsb.events

LSF uses the lsb.events file to keep track of the state of all jobs. Each job is a transaction from job submission to job completion. LSF system keeps track of everything associated with the job in the lsb.events file.

lsb.events.n

The events file is automatically trimmed and old job events are stored in lsb.event.n files. When mbatchd starts, it refers only to the lsb.events file, not the lsb.events.n files. The bhist command can refer to these files.

Job script files in the info directory

When a user issues a bsub command from a shell prompt, LSF collects all of the commands issued on the bsub line and spools the data to mbatchd, which saves the bsub command script in the info directory for use at dispatch time or if the job is rerun. The info directory is managed by LSF and should not be modified by anyone.

Log directory permissions and ownership

Ensure that the permissions on the LSF_LOGDIR directory to be writable by root. The LSF administrator must own LSF_LOGDIR.

Support for UNICOS accounting

In Cray UNICOS environments, LSF writes to the Network Queuing System (NQS) accounting data file, nqacct, on the execution host. This lets you track LSF jobs and other jobs together, through NQS.

Support for IRIX Comprehensive System Accounting (CSA)

The IRIX 6.5.9 Comprehensive System Accounting facility (CSA) writes an accounting record for each process in the pacct file, which is usually located in the /var/adm/acct/day directory. IRIX system administrators then use the csabuild command to organize and present the records on a job by job basis.

The LSF_ENABLE_CSA parameter in lsf.conf enables LSF to write job events to the pacct file for processing through CSA. For LSF job accounting, records are written to pacct at the start and end of each LSF job.

See the Platform LSF Reference for more information about the LSF_ENABLE_CSA parameter.

See the IRIX 6.5.9 resource administration documentation for information about CSA.

[ Top ]


Managing Error Logs

Error logs maintain important information about LSF operations. When you see any abnormal behavior in LSF, you should first check the appropriate error logs to find out the cause of the problem.

LSF log files grow over time. These files should occasionally be cleared, either by hand or using automatic scripts.

Daemon error log

LSF log files are reopened each time a message is logged, so if you rename or remove a daemon log file, the daemons will automatically create a new log file.

The LSF daemons log messages when they detect problems or unusual situations.

The daemons can be configured to put these messages into files.

The error log file names for the LSF system daemons are:

LSF daemons log error messages in different levels so that you can choose to log all messages, or only log messages that are deemed critical. Message logging is controlled by the parameter LSF_LOG_MASK in lsf.conf. Possible values for this parameter can be any log priority symbol that is defined in /usr/include/sys/syslog.h. The default value for LSF_LOG_MASK is LOG_WARNING.

Error logging

If the optional LSF_LOGDIR parameter is defined in lsf.conf, error messages from LSF servers are logged to files in this directory.

If LSF_LOGDIR is defined, but the daemons cannot write to files there, the error log files are created in /tmp.

If LSF_LOGDIR is not defined, errors are logged to the system error logs (syslog) using the LOG_DAEMON facility. syslog messages are highly configurable, and the default configuration varies widely from system to system. Start by looking for the file /etc/syslog.conf, and read the man pages for syslog(3) and syslogd(1).

If the error log is managed by syslog, it is probably already being automatically cleared.

If LSF daemons cannot find lsf.conf when they start, they will not find the definition of LSF_LOGDIR. In this case, error messages go to syslog. If you cannot find any error messages in the log files, they are likely in the syslog.

[ Top ]


System Event Log

The LSF daemons keep an event log in the lsb.events file. The mbatchd daemon uses this information to recover from server failures, host reboots, and mbatchd restarts. The lsb.events file is also used by the bhist command to display detailed information about the execution history of batch jobs, and by the badmin command to display the operational history of hosts, queues, and daemons.

By default, mbatchd automatically backs up and rewrites the lsb.events file after every 1000 batch job completions. This value is controlled by the MAX_JOB_NUM parameter in the lsb.params file. The old lsb.events file is moved to lsb.events.1, and each old lsb.events.n file is moved to lsb.events.n+1. LSF never deletes these files. If disk storage is a concern, the LSF administrator should arrange to archive or remove old lsb.events.n files periodically.

CAUTION

Do not remove or modify the current lsb.events file. Removing or modifying the lsb.events file could cause batch jobs to be lost.

[ Top ]


Duplicate Logging of Event Logs

To recover from server failures, host reboots, or mbatchd restarts, LSF uses information stored in lsb.events. To improve the reliability of LSF, you can configure LSF to maintain copies of these logs, to use as a backup.

If the host that contains the primary copy of the logs fails, LSF will continue to operate using the duplicate logs. When the host recovers, LSF uses the duplicate logs to update the primary copies.

How duplicate logging works

By default, the event log is located in LSB_SHAREDIR. Typically, LSB_SHAREDIR resides on a reliable file server that also contains other critical applications necessary for running jobs, so if that host becomes unavailable, the subsequent failure of LSF is a secondary issue. LSB_SHAREDIR must be accessible from all potential LSF master hosts.

When you configure duplicate logging, the duplicates are kept on the file server, and the primary event logs are stored on the first master host. In other words, LSB_LOCALDIR is used to store the primary copy of the batch state information, and the contents of LSB_LOCALDIR are copied to a replica in LSB_SHAREDIR, which resides on a central file server. This has the following effects:

Failure of file server

If the file server containing LSB_SHAREDIR goes down, LSF continues to process jobs. Client commands such as bhist, which directly read LSB_SHAREDIR will not work.

When the file server recovers, the current log files are replicated to LSB_SHAREDIR.

Failure of first master host

If the first master host fails, the primary copies of the files (in LSB_LOCALDIR) become unavailable. Then, a new master host is selected. The new master host uses the duplicate files (in LSB_SHAREDIR) to restore its state and to log future events. There is no duplication by the second or any subsequent LSF master hosts.

When the first master host becomes available after a failure, it will update the primary copies of the files (in LSB_LOCALDIR) from the duplicates (in) and continue operations as before.

If the first master host does not recover, LSF will continue to use the files in LSB_SHAREDIR, but there is no more duplication of the log files.

Simultaneous failure of both hosts

If the master host containing LSB_LOCALDIR and the file server containing LSB_SHAREDIR both fail simultaneously, LSF will be unavailable.

Network partioning

We assume that Network partitioning does not cause a cluster to split into two independent clusters, each simultaneously running mbatchd.

This may happen given certain network topologies and failure modes. For example, connectivity is lost between the first master, M1, and both the file server and the secondary master, M2. Both M1 and M2 will run mbatchd service with M1 logging events to LSB_LOCALDIR and M2 logging to LSB_SHAREDIR. When connectivity is restored, the changes made by M2 to LSB_SHAREDIR will be lost when M1 updates LSB_SHAREDIR from its copy in LSB_LOCALDIR.

The archived event files are only available on LSB_LOCALDIR, so in the case of network partitioning, commands such as bhist cannot access these files. As a precaution, you should periodically copy the archived files from LSB_LOCALDIR to LSB_SHAREDIR.

Setting an event update interval

If NFS traffic is too high and you want to reduce network traffic, use EVENT_UPDATE_INTERVAL in lsb.params to specify how often to back up the data and synchronize the LSB_SHAREDIR and LSB_LOCALDIR directories.

The directories are always synchronized when data is logged to the files, or when mbatchd is started on the first LSF master host.

Automatic archiving and duplicate logging

Event logs

Archived event logs, lsb.events.n, are not replicated to LSB_SHAREDIR. If LSF starts a new event log while the file server containing LSB_SHAREDIR is down, you might notice a gap in the historical data in LSB_SHAREDIR.

Configuring duplicate logging

To enable duplicate logging, set LSB_LOCALDIR in lsf.conf to a directory on the first master host (the first host configured in lsf.cluster.cluster_name) that will be used to store the primary copies of lsb.events. This directory should only exist on the first master host.

  1. Edit lsf.conf and set LSB_LOCALDIR to a local directory that exists only on the first master host.
  2. Use the commands lsadmin reconfig and badmin mbdrestart to make the changes take effect.

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: January 12, 2004
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.