[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- Checkpointing Jobs
- Approaches to Checkpointing
- Creating Custom echkpnt and erestart for Application-level Checkpointing
- Restarting Checkpointed Jobs
- Migrating Jobs
[ Top ]
Checkpointing Jobs
Checkpointing a job involves capturing the state of an executing job, the data necessary to restart the job, and not wasting the work done to get to the current stage. The job state information is saved in a checkpoint file. There are many reasons why you would want to checkpoint a job.
Fault tolerance
To provide job fault tolerance, checkpoints are taken at regular intervals (periodically) during the job's execution. If the job is killed or migrated, or if the job fails for a reason other than host failure, the job can be restarted from its last checkpoint and not waste the efforts to get it to its current stage.
Migration
Checkpointing enables a migrating job to make progress rather than restarting the job from the beginning. Jobs can be migrated when a host fails or when a host becomes unavailable due to load.
Load balancing
Checkpointing a job and restarting it (migrating) on another host provides load balancing by moving load (jobs) from a heavily loaded host to a lightly loaded host.
[ Top ]
Approaches to Checkpointing
LSF provides support for most checkpoint and restart implementations through uniform interfaces,
echkpntanderestart. All interaction between LSF and the checkpoint implementations are handled by these commands. See theechkpnt(8)anderestart(8)man pages for more information.Checkpoint and restart implementations are categorized based on the facility that performs the checkpoint and the amount of knowledge an executable has of the checkpoint. Commonly, checkpoint and restart implementations are grouped as kernel-level, user-level, and application-level.
Kernel-level checkpointing
Kernel-level checkpointing is provided by the operating system and can be applied to arbitrary jobs running on the system. This approach is transparent to the application, there are no source code changes and no need to re-link your application with checkpoint libraries.
To support kernel-level checkpoint and restart, LSF provides an
echkpntanderestartexecutable that invokes OS specific system calls.Kernel-level checkpointing is currently supported on:
See the
chkpnt(1)man page on Cray systems and thecpr(1)man page on IRIX systems for the limitations of their checkpoint implementations.User-level checkpointing
LSF provides a method to checkpoint jobs on systems that do not support kernel-level checkpointing called user-level checkpointing. To implement user-level checkpointing, you must have access to your applications object files (.o files), and they must be re-linked with a set of libraries provided by LSF in LSF_LIBDIR. This approach is transparent to your application, its code does not have to be changed and the application does not know that a checkpoint and restart has occurred.
Application-level checkpointing
The application-level approach applies to those applications which are specially written to accommodate the checkpoint and restart. The application writer must also provide an
echkpntanderestartto interface with LSF. For more details see theechkpnt(8)anderestart(8)man pages. The application checkpoints itself either periodically or in response to signals sent by other processes. When restarted, the application itself must look for the checkpoint files and restore its state.[ Top ]
Creating Custom echkpnt and erestart for Application-level Checkpointing
Different applications may have different checkpointing implementations and custom
echkpntanderestartprograms.You can write your own
echkpntanderestartprograms to checkpoint your specific applications and tell LSF which program to use for which application.
- Writing custom echkpnt and erestart programs
- Configuring LSF to recognize the custom echkpnt and erestart
Writing custom echkpnt and erestart programs
You can write your own
echkpntanderestartinterfaces in C or Fortran.Assign the name
echkpnt.method_name anderestart.method_name, where method_name is the name that identifies this is the program for a specific application.For example, if your custom
echkpntis formy_app, you would have:echkpnt.my_app,erestart.my_app.Place
echkpnt.method_name anderestart.method_name in LSF_SERVERDIR. You can specify a different directory with LSB_ECHKPNT_METHOD_DIR as an environment variable or inlsf.conf.The method name (LSB_ECHKPNT_METHOD in
lsf.confor as an environment variable) and location (LSB_ECHKPNT_METHOD_DIR) combination must be unique in the cluster. For example, you may have twoechkpntapplications with the same name such asechkpnt.mymethodbut what differentiates them is the different directories defined with LSB_ECHKPNT_METHOD_DIR.The checkpoint method directory should be accessible by all users who need to run the custom echkpnt and erestart programs.
Your
echkpnt.method_namemust recognize commands in the following syntax as these are the options used byechkpntto communicate with yourechkpnt.method_name:echkpnt[-c] [-f] [-k | -s] [-dcheckpoint_dir] [-x] process_group_IDFor more details on
echkpntsyntax, see theechkpnt(8)man page.Your
erestart.method_name must recognize commands in the following syntax as these are the options used byerestartto communicate with yourerestart.method_name .erestart[-c] [-f] checkpoint_dirFor more details, see the
erestart(8)man page.Return values for echkpnt.method_name
If
echkpnt.method_name is able to successfully checkpoint the job, it exits with a 0. Non-zero values indicate job checkpoint failed.
stderrandstdoutare ignored by LSF. You can save these to a file by setting LSB_ECHKPNT_KEEP_OUTPUT=y inlsf.confor as an environment variable.Return values for erestart.method_name
erestart.method_name creates the file checkpoint_dir/$LSB_JOBID/.restart_cmdand writes in this file the command to restart the job or process group in the form:LSB_RESTART_CMD=restart_commandFor example, if the command to restart a job is
my_restart my_job, theerestart.method_name writes to the.restart_cmdfile:LSB_RESTART_CMD=my_restart my_job
erestartthen reads the .restart_cmdfile and uses the command specified with LSB_RESTART_CMD as the command to restart the job.You have the choice of writing to the file or not. Return a 0 if
erestart.method_name succeeds in writing the job restart command to the file checkpoint_dir/$LSB_JOBID/.restart_cmd, or if it purposefully writes nothing to the file. Non-zero values indicate thaterestart.method_name was not able to restart the job.For user-level checkpointing,
erestart.method_name must collect the exit code from the job. Then,erestart.method_name must exit with the same exit code as the job. Otherwise, the job's exit status is not reported correctly to LSF. Kernel-level checkpointing works differently and does not need this information fromerestart.method_name to restart the job.erestart.method_name
- Must have access to the original command line. It is important the
erestart.method_name have access to the original command line used to start the job.erestart.method_name must return, it should not run the application to restart the job.
Any information echkpnt writes tostderris considered bysbatchdas anechkpntfailure. However, not all errors are fatal. If thechkpntexplicitly writes tostdoutorstderr"Checkpoint done",sbatchdassumesechkpnthas succeeded.
Configuring LSF to recognize the custom echkpnt and erestart
You can set the following parameters in
lsf.confor as environment variables. If set inlsf.conf, these parameters apply globally to the cluster and will be the default values. Parameters specified as environment variables override the parameters specified inlsf.conf.If you set parameters in
lsf.conf, reconfigure your cluster with lsadmin reconfigandbadmin mbdrestartso that changes take effect.
- Set LSB_ECHKPNT_METHOD=method_name in
lsf.confor as an environment variableOR
When you submit the job, specify the checkpoint and restart method. For example:
% bsub -k "mydir method=myapp" job1- Copy your
echkpnt.method_name anderestart.method_name to LSF_SERVERDIR.OR
If you want to specify a different directory than LSF_SERVERDIR, in
lsf.confor as an environment variable set LSB_ECHKPNT_METHOD_DIR= absolute path to the directory in which yourechkpnt.method_nameanderestart.method_name are located.The checkpoint method directory should be accessible by all users who need to run the custom echkpnt and erestart programs.
- (Optional)
To save standard error and standard output messages for
echkpnt. method_name anderestart.method_name set LSB_ECHKPNT_KEEP_OUTPUT=y inlsf.confor as an environment variable.The
stdoutandstderroutput generated byechkpnt. method_name will be redirected to:The
stdoutandstderroutput generated byerestart.method_name will be redirected to:Otherwise, if LSB_ECHKPNT_KEEP_OUTPUT is not defined, standard error and output will be redirected to
/dev/nulland discarded.[ Top ]
Checkpointing a Job
Before LSF can checkpoint a job, it must be made checkpointable. LSF provides automatic and manual controls to make jobs checkpointable and to checkpoint jobs. When working with checkpointable jobs, a checkpoint directory must always be specified. Optionally, a checkpoint period can be specified to enable periodic checkpointing.
When a job is checkpointed, LSF performs the following actions:
- Stops the job if its running
- Creates the checkpoint file in the checkpoint directory
- Restarts the job
LSF can create a checkpoint for any eligible job. Review the discussion about Approaches to Checkpointing to determine if your application and environment are suitable for checkpointing.
- The Checkpoint Directory
- Making Jobs Checkpointable
- Manually Checkpointing Jobs
- Enabling Periodic Checkpointing
- Automatically Checkpointing Jobs
[ Top ]
The Checkpoint Directory
A checkpoint directory must be specified for every checkpointable job and is used to store the files to restart a job. The directory must be writable by the job owner. To restart the job on another host (job migration), the directory must be accessible by both hosts. LSF does not delete the checkpoint files; checkpoint file maintenance is the user's responsibility.
LSF writes the checkpoint file in a directory named with the job ID of the job being checkpointed under the checkpoint directory. This allows LSF to checkpoint multiple jobs to the same checkpoint directory. For example, when you specify a checkpoint directory called
my_dirand when job 123 is checkpointed, LSF will save the checkpoint file in:my_dir/123/When LSF restarts a checkpointed job, it renames the checkpoint directory using the job ID of the new job and creates a symbolic link from the old checkpoint directory to the new one. For example, if a job with job ID 123 is restarted with job ID 456 the checkpoint directory will be renamed to:
my_dir/456/[ Top ]
Making Jobs Checkpointable
Making a job checkpointable involves specifying the location of a checkpoint directory to LSF. This can be done manually on the command line or automatically through configuration.
Manually
Manually making a job checkpointable involves specifying the checkpoint directory on the command line. LSF will create the directory if it does not exist. A job can be made checkpointable at job submission or after submission.
Use the
-k "checkpoint_dir"option ofbsubto specify the checkpoint directory for a job at submission. For example, to specifymy_diras the checkpoint directory formy_job:%bsub -k "my_dir" my_jobJob <123> is submitted to default queue <default>.Use the
-k "checkpoint_dir"option ofbmodto specify the checkpoint directory for a job after submission. For example, to specifymy_diras the checkpoint directory for a job with job ID 123:%bmod -k "my_dir" 123Parameters of job <123> are being changedAutomatically
Automatically making a job checkpointable involves submitting the job to a queue that is configured for checkpointable jobs. To configure a queue, edit
lsb.queuesand specify the checkpoint directory for the CHKPNT parameter on a queue. The checkpoint directory must already exist, LSF will not create the directory.For example, to configure a queue for checkpointable jobs using a directory named
my_dir:Begin Queue ... CHKPNT=my_dir DESCRIPTION = Make jobs checkpointable using "my_dir" ... End Queue[ Top ]
Manually Checkpointing Jobs
LSF provides the
bchkpntcommand to manually checkpoint jobs. LSF can only perform a checkpoint for checkpointable jobs as described in Making Jobs Checkpointable. For example, to checkpoint a job with job ID 123:%bchkpnt 123Job <123> is being checkpointedInteractive jobs (
bsub -I) cannot be checkpointed.Checkpointing and killing a job
By default, after a job has been successfully checkpointed, it continues to run. Use the
bchkpnt -kcommand to kill the job after the checkpoint file has been created. Killing the job ensures the job does not do any processing or I/O after the checkpoint. For example, to checkpoint and kill a job with job ID 123:%bchkpnt -k 123Job <123> is being checkpointed[ Top ]
Enabling Periodic Checkpointing
Periodic checkpointing involves creating a checkpoint file at regular time intervals during the execution of your job. LSF provides the ability to enable periodic checkpointing manually on the command line and automatically through configuration. Automatic periodic checkpointing is discussed in Automatically Checkpointing Jobs. LSF can only perform a checkpoint for checkpointable jobs as described in Making Jobs Checkpointable.
Manually enabling periodic checkpointing involves specifying a checkpoint period in minutes.
LSF uses the
-k "checkpoint_dir checkpoint_period"option ofbsubto enable periodic checkpointing at job submission. For example, to periodically checkpointmy_jobevery 2 hours (120 minutes):%bsub -k "my_dir 120" my_jobJob <123> is submitted to default queue <default>.LSF uses the
-pperiod option ofbchkpntto enable periodic checkpointing after submission. When a checkpoint period is specified after submission, LSF checkpoints the job immediately then checkpoints it again after the specified period of time. For example, to periodically checkpoint a job with job ID 123 every 2 hours (120 minutes):%bchkpnt -p 120 123Job <123> is being checkpointedYou can also use the
-poption ofbchkpntto change a checkpoint period. For example, to change the checkpoint period of a job with job ID 123 to every 4 hours (240 minutes):%bchkpnt -p 240 123Job <123> is being checkpointedDisabling periodic checkpointing
To disable periodic checkpointing, specify a period of 0 (zero). For example, to disable periodic checkpointing for a job with job ID 123:
%bchkpnt -p 0 123Job <123> is being checkpointed[ Top ]
Automatically Checkpointing Jobs
Automatically checkpointing jobs involves submitting a job to a queue that is configured for periodic checkpointing. To configure a queue, edit
lsb.queuesand specify a checkpoint directory and a checkpoint period for the CHKPNT parameter for a queue. The checkpoint directory must already exist, LSF will not create the directory. The checkpoint period is specified in minutes. All jobs submitted to the queue will be automatically checkpointed. For example, to configure a queue to periodically checkpoint jobs every 4 hours (240 minutes) to a directory namedmy_dir:Begin Queue ...CHKPNT=my_dir 240DESCRIPTION=Auto chkpnt every 4 hrs (240 min) to my_dir ... End QueueAll jobs submitted to a queue configured for checkpointing can also be checkpointed using
bchkpnt. Jobs submitted and modified using-k,-r,-p, and-knoptions override queue configured options.[ Top ]
Restarting Checkpointed Jobs
LSF can restart a checkpointed job on a host other than the original execution host using the information saved in the checkpoint file to recreate the execution environment. Only jobs that have been checkpointed successfully can be restarted from a checkpoint file. When a job is restarted, LSF performs the following actions:
- LSF re-submits the job to its original queue as a new job and a new job ID is assigned
- When a suitable host is available, the job is dispatched
- The execution environment is recreated from the checkpoint file
- The job is restarted from its last checkpoint.
This can be done manually from the command line, automatically through configuration, and when a job is migrated.
Requirements
LSF can restart a job from its last checkpoint on the execution host, or on another host if the job is migrated. To restart a job on another host, both hosts must:
- Be binary compatible
- Run the same dot version of the operating system. Unpredictable results may occur if both hosts are not running the exact same OS version.
- Have access to the executable
- Have access to all open files (LSF must locate them with an absolute path name)
- Have access to the checkpoint file
Manually restarting jobs
Use the
brestartcommand to manually restart a checkpointed job. To restart a job from its last checkpoint, specify the checkpoint directory and the job ID of the checkpointed job. For example, to restart a checkpointed job with job ID 123 from checkpoint directorymy_dir:%brestart my_dir 123Job <456> is submitted to default queue <default>The
brestartcommand allows you to change many of the original submission options. For example, to restart a checkpointed job with job ID 123 from checkpoint directorymy_dirand have it start from a queue namedpriority:%brestart -q priority my_dir 123Job <456> is submitted to queue <priority>[ Top ]
Migrating Jobs
Migration is the process of moving a checkpointable or rerunnable job from one host to another host.
Checkpointing enables a migrating job to make progress by restarting it from its last checkpoint. Rerunnable non-checkpointable jobs are restarted from the beginning. LSF provides the ability to manually migrate jobs from the command line and automatically through configuration. When a job is migrated, LSF performs the following actions:
- Stops the job if it is running
- Checkpoints the job if it is checkpointable
- Kills the job on the current host
- Restarts or reruns the job on the next available host, bypassing all pending jobs
Requirements
To migrate a checkpointable job to another host, both hosts must:
- Be binary compatible
- Run the same dot version of the operating system. Unpredictable results may occur if both hosts are not running the exact same OS version.
- Have access to the executable
- Have access to all open files (LSF must locate them with an absolute path name)
- Have access to the checkpoint file
Manually migrating jobs
Use the
bmigcommand to manually migrate jobs. Any checkpointable or rerunnable job can be migrated. Jobs can be manually migrated by the job owner, queue administrator, and LSF administrator. For example, to migrate a job with job ID 123:%bmig 123Job <123> is being migrated %bhist -l 123Job Id <123>, User <user1>, Command <my_job> Tue Feb 29 16:50:27: Submitted from host <hostA> to Queue <default>, C WD <$HOME/tmp>, Checkpoint directory <chkpnt_dir/123>; Tue Feb 29 16:50:28: Started on <hostB>, Pid <4705>; Tue Feb 29 16:53:42: Migration requested; Tue Feb 29 16:54:03: Migration checkpoint initiated (actpid 4746); Tue Feb 29 16:54:15: Migration checkpoint succeeded (actpid 4746); Tue Feb 29 16:54:15: Pending: Migrating job is waiting for reschedule; Tue Feb 29 16:55:16: Started on <hostC>, Pid <10354>. Summary of time in seconds spent in various states by Tue Feb 29 16:57:26 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 62 0 357 0 0 0 419Automatically Migrating Jobs
Automatic job migration works on the premise that if a job is suspended (SSUSP) for an extended period of time, due to load conditions or any other reason, the execution host is heavily loaded. To allow the job to make progress and to reduce the load on the host, a migration threshold is configured. LSF allows migration thresholds to be configured for queues and hosts. The threshold is specified in minutes.
When configured on a queue, the threshold will apply to all jobs submitted to the queue. When defined at the host level, the threshold will apply to all jobs running on the host. When a migration threshold is configured on both a queue and host, the lower threshold value is used. If the migration threshold is configured to 0 (zero), the job will be migrated immediately upon suspension (SSUSP).
You can use
bmigat anytime to override a configured threshold.To configure a migration threshold for a queue, edit
lsb.queuesand specify a threshold for the MIG parameter. For example, to configure a queue to migrate suspended jobs after 30 minutes:Begin Queue ...MIG=30# Migration threshold set to 30 mins DESCRIPTION=Migrate suspended jobs after 30 mins ... End QueueTo configure a migration threshold for a host, edit
lsb.hostsand specify a threshold for the MIG parameter for a host. For example, to configure a host to migrate suspended jobs after 30 minutes:Begin Host HOST_NAME r1m pgMIG# Keywords ... hostA 5.0 1830... End HostRequeuing migrating jobs
By default, LSF restarts or reruns a migrating job on the next available host, bypassing all pending jobs.
You can configure LSF to requeue migrating jobs rather than immediately restarting them. Jobs will be requeued in PEND state and ordered according to their original submission time and priority. To requeue migrating jobs, edit
lsf.confand set LSB_MIG2PEND=1.Additionally, you can configure LSF to requeue migrating jobs to the bottom of the queue by editing
lsf.confand setting LSB_MIG2PEND=1 and LSB_REQUEUE_TO_BOTTOM=1.[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: January 12, 2004
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.