Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Job Checkpoint, Restart, and Migration


Contents

[ Top ]


Checkpointing Jobs

Checkpointing a job involves capturing the state of an executing job, the data necessary to restart the job, and not wasting the work done to get to the current stage. The job state information is saved in a checkpoint file. There are many reasons why you would want to checkpoint a job.

Fault tolerance

To provide job fault tolerance, checkpoints are taken at regular intervals (periodically) during the job's execution. If the job is killed or migrated, or if the job fails for a reason other than host failure, the job can be restarted from its last checkpoint and not waste the efforts to get it to its current stage.

Migration

Checkpointing enables a migrating job to make progress rather than restarting the job from the beginning. Jobs can be migrated when a host fails or when a host becomes unavailable due to load.

Load balancing

Checkpointing a job and restarting it (migrating) on another host provides load balancing by moving load (jobs) from a heavily loaded host to a lightly loaded host.

In this section

[ Top ]


Approaches to Checkpointing

LSF provides support for most checkpoint and restart implementations through uniform interfaces, echkpnt and erestart. All interaction between LSF and the checkpoint implementations are handled by these commands. See the echkpnt(8) and erestart(8) man pages for more information.

Checkpoint and restart implementations are categorized based on the facility that performs the checkpoint and the amount of knowledge an executable has of the checkpoint. Commonly, checkpoint and restart implementations are grouped as kernel-level, user-level, and application-level.

Kernel-level checkpointing

Kernel-level checkpointing is provided by the operating system and can be applied to arbitrary jobs running on the system. This approach is transparent to the application, there are no source code changes and no need to re-link your application with checkpoint libraries.

To support kernel-level checkpoint and restart, LSF provides an echkpnt and erestart executable that invokes OS specific system calls.

Kernel-level checkpointing is currently supported on:

See the chkpnt(1) man page on Cray systems and the cpr(1) man page on IRIX systems for the limitations of their checkpoint implementations.

User-level checkpointing

LSF provides a method to checkpoint jobs on systems that do not support kernel-level checkpointing called user-level checkpointing. To implement user-level checkpointing, you must have access to your applications object files (.o files), and they must be re-linked with a set of libraries provided by LSF in LSF_LIBDIR. This approach is transparent to your application, its code does not have to be changed and the application does not know that a checkpoint and restart has occurred.

Application-level checkpointing

The application-level approach applies to those applications which are specially written to accommodate the checkpoint and restart. The application writer must also provide an echkpnt and erestart to interface with LSF. For more details see the echkpnt(8) and erestart(8) man pages. The application checkpoints itself either periodically or in response to signals sent by other processes. When restarted, the application itself must look for the checkpoint files and restore its state.

[ Top ]


Creating Custom echkpnt and erestart for Application-level Checkpointing

Different applications may have different checkpointing implementations and custom echkpnt and erestart programs.

You can write your own echkpnt and erestart programs to checkpoint your specific applications and tell LSF which program to use for which application.

Writing custom echkpnt and erestart programs

Programming language

You can write your own echkpnt and erestart interfaces in C or Fortran.

Name

Assign the name echkpnt.method_name and erestart.method_name, where method_name is the name that identifies this is the program for a specific application.

For example, if your custom echkpnt is for my_app, you would have: echkpnt.my_app, erestart.my_app.

Location

Place echkpnt.method_name and erestart.method_name in LSF_SERVERDIR. You can specify a different directory with LSB_ECHKPNT_METHOD_DIR as an environment variable or in lsf.conf.

The method name (LSB_ECHKPNT_METHOD in lsf.conf or as an environment variable) and location (LSB_ECHKPNT_METHOD_DIR) combination must be unique in the cluster. For example, you may have two echkpnt applications with the same name such as echkpnt.mymethod but what differentiates them is the different directories defined with LSB_ECHKPNT_METHOD_DIR.

The checkpoint method directory should be accessible by all users who need to run the custom echkpnt and erestart programs.

Supported syntax for echkpnt

Your echkpnt.method_name must recognize commands in the following syntax as these are the options used by echkpnt to communicate with your echkpnt.method_name:

echkpnt [-c] [-f] [-k | -s] [-d checkpoint_dir] [-x] process_group_ID

For more details on echkpnt syntax, see the echkpnt(8) man page.

Supported syntax for erestart

Your erestart.method_name must recognize commands in the following syntax as these are the options used by erestart to communicate with your erestart.method_name .

erestart [-c] [-f] checkpoint_dir 

For more details, see the erestart(8) man page.

Return values for echkpnt.method_name

If echkpnt.method_name is able to successfully checkpoint the job, it exits with a 0. Non-zero values indicate job checkpoint failed.

stderr and stdout are ignored by LSF. You can save these to a file by setting LSB_ECHKPNT_KEEP_OUTPUT=y in lsf.conf or as an environment variable.

Return values for erestart.method_name

erestart.method_name creates the file checkpoint_dir/$LSB_JOBID/.restart_cmd and writes in this file the command to restart the job or process group in the form:

LSB_RESTART_CMD=restart_command

For example, if the command to restart a job is my_restart my_job, the erestart.method_name writes to the .restart_cmd file:

LSB_RESTART_CMD=my_restart my_job

erestart then reads the .restart_cmd file and uses the command specified with LSB_RESTART_CMD as the command to restart the job.

You have the choice of writing to the file or not. Return a 0 if erestart.method_name succeeds in writing the job restart command to the file checkpoint_dir/$LSB_JOBID/.restart_cmd, or if it purposefully writes nothing to the file. Non-zero values indicate that erestart.method_name was not able to restart the job.

For user-level checkpointing, erestart.method_name must collect the exit code from the job. Then, erestart.method_name must exit with the same exit code as the job. Otherwise, the job's exit status is not reported correctly to LSF. Kernel-level checkpointing works differently and does not need this information from erestart.method_name to restart the job.

erestart.method_name

Note


Any information echkpnt writes to stderr is considered by sbatchd as an echkpnt failure. However, not all errors are fatal. If the chkpnt explicitly writes to stdout or stderr "Checkpoint done", sbatchd assumes echkpnt has succeeded.

Configuring LSF to recognize the custom echkpnt and erestart

You can set the following parameters in lsf.conf or as environment variables. If set in lsf.conf, these parameters apply globally to the cluster and will be the default values. Parameters specified as environment variables override the parameters specified in lsf.conf.

If you set parameters in lsf.conf, reconfigure your cluster with lsadmin reconfig and badmin mbdrestart so that changes take effect.

  1. Set LSB_ECHKPNT_METHOD=method_name in lsf.conf or as an environment variable

    OR

    When you submit the job, specify the checkpoint and restart method. For example:

    % bsub -k "mydir method=myapp" job1
    
  2. Copy your echkpnt.method_name and erestart.method_name to LSF_SERVERDIR.

    OR

    If you want to specify a different directory than LSF_SERVERDIR, in lsf.conf or as an environment variable set LSB_ECHKPNT_METHOD_DIR= absolute path to the directory in which your echkpnt.method_name and erestart.method_name are located.

    The checkpoint method directory should be accessible by all users who need to run the custom echkpnt and erestart programs.

  3. (Optional)

    To save standard error and standard output messages for echkpnt. method_name and erestart.method_name set LSB_ECHKPNT_KEEP_OUTPUT=y in lsf.conf or as an environment variable.

    The stdout and stderr output generated by echkpnt. method_name will be redirected to:

    • checkpoint_dir/$LSB_JOBID/echkpnt.out
    • checkpoint_dir/$LSB_JOBID/echkpnt.err

    The stdout and stderr output generated by erestart.method_name will be redirected to:

    • checkpoint_dir/$LSB_JOBID/erestart.out
    • checkpoint_dir/$LSB_JOBID/erestart.err

    Otherwise, if LSB_ECHKPNT_KEEP_OUTPUT is not defined, standard error and output will be redirected to /dev/null and discarded.

[ Top ]


Checkpointing a Job

Before LSF can checkpoint a job, it must be made checkpointable. LSF provides automatic and manual controls to make jobs checkpointable and to checkpoint jobs. When working with checkpointable jobs, a checkpoint directory must always be specified. Optionally, a checkpoint period can be specified to enable periodic checkpointing.

When a job is checkpointed, LSF performs the following actions:

  1. Stops the job if its running
  2. Creates the checkpoint file in the checkpoint directory
  3. Restarts the job

Prerequisites

LSF can create a checkpoint for any eligible job. Review the discussion about Approaches to Checkpointing to determine if your application and environment are suitable for checkpointing.

In this section

[ Top ]


The Checkpoint Directory

A checkpoint directory must be specified for every checkpointable job and is used to store the files to restart a job. The directory must be writable by the job owner. To restart the job on another host (job migration), the directory must be accessible by both hosts. LSF does not delete the checkpoint files; checkpoint file maintenance is the user's responsibility.

LSF writes the checkpoint file in a directory named with the job ID of the job being checkpointed under the checkpoint directory. This allows LSF to checkpoint multiple jobs to the same checkpoint directory. For example, when you specify a checkpoint directory called my_dir and when job 123 is checkpointed, LSF will save the checkpoint file in:

my_dir/123/

When LSF restarts a checkpointed job, it renames the checkpoint directory using the job ID of the new job and creates a symbolic link from the old checkpoint directory to the new one. For example, if a job with job ID 123 is restarted with job ID 456 the checkpoint directory will be renamed to:

my_dir/456/

[ Top ]


Making Jobs Checkpointable

Making a job checkpointable involves specifying the location of a checkpoint directory to LSF. This can be done manually on the command line or automatically through configuration.

Manually

Manually making a job checkpointable involves specifying the checkpoint directory on the command line. LSF will create the directory if it does not exist. A job can be made checkpointable at job submission or after submission.

At job submission

Use the -k "checkpoint_dir" option of bsub to specify the checkpoint directory for a job at submission. For example, to specify my_dir as the checkpoint directory for my_job:

% bsub -k "my_dir" my_job
Job <123> is submitted to default queue <default>.

After job submission

Use the -k "checkpoint_dir" option of bmod to specify the checkpoint directory for a job after submission. For example, to specify my_dir as the checkpoint directory for a job with job ID 123:

% bmod -k "my_dir" 123
Parameters of job <123> are being changed

Automatically

Automatically making a job checkpointable involves submitting the job to a queue that is configured for checkpointable jobs. To configure a queue, edit lsb.queues and specify the checkpoint directory for the CHKPNT parameter on a queue. The checkpoint directory must already exist, LSF will not create the directory.

For example, to configure a queue for checkpointable jobs using a directory named my_dir:

Begin Queue
  ... 
  CHKPNT=my_dir
  DESCRIPTION = Make jobs checkpointable using "my_dir"
  ... 
End Queue

[ Top ]


Manually Checkpointing Jobs

LSF provides the bchkpnt command to manually checkpoint jobs. LSF can only perform a checkpoint for checkpointable jobs as described in Making Jobs Checkpointable. For example, to checkpoint a job with job ID 123:

% bchkpnt 123
Job <123> is being checkpointed

Interactive jobs (bsub -I) cannot be checkpointed.

Checkpointing and killing a job

By default, after a job has been successfully checkpointed, it continues to run. Use the bchkpnt -k command to kill the job after the checkpoint file has been created. Killing the job ensures the job does not do any processing or I/O after the checkpoint. For example, to checkpoint and kill a job with job ID 123:

% bchkpnt -k 123
Job <123> is being checkpointed

[ Top ]


Enabling Periodic Checkpointing

Periodic checkpointing involves creating a checkpoint file at regular time intervals during the execution of your job. LSF provides the ability to enable periodic checkpointing manually on the command line and automatically through configuration. Automatic periodic checkpointing is discussed in Automatically Checkpointing Jobs. LSF can only perform a checkpoint for checkpointable jobs as described in Making Jobs Checkpointable.

Manually enabling periodic checkpointing involves specifying a checkpoint period in minutes.

At job submission

LSF uses the -k "checkpoint_dir checkpoint_period" option of bsub to enable periodic checkpointing at job submission. For example, to periodically checkpoint my_job every 2 hours (120 minutes):

% bsub -k "my_dir 120" my_job
Job <123> is submitted to default queue <default>.

After job submission

LSF uses the -p period option of bchkpnt to enable periodic checkpointing after submission. When a checkpoint period is specified after submission, LSF checkpoints the job immediately then checkpoints it again after the specified period of time. For example, to periodically checkpoint a job with job ID 123 every 2 hours (120 minutes):

% bchkpnt -p 120 123
Job <123> is being checkpointed

You can also use the -p option of bchkpnt to change a checkpoint period. For example, to change the checkpoint period of a job with job ID 123 to every 4 hours (240 minutes):

% bchkpnt -p 240 123
Job <123> is being checkpointed

Disabling periodic checkpointing

To disable periodic checkpointing, specify a period of 0 (zero). For example, to disable periodic checkpointing for a job with job ID 123:

% bchkpnt -p 0 123
Job <123> is being checkpointed

[ Top ]


Automatically Checkpointing Jobs

Automatically checkpointing jobs involves submitting a job to a queue that is configured for periodic checkpointing. To configure a queue, edit lsb.queues and specify a checkpoint directory and a checkpoint period for the CHKPNT parameter for a queue. The checkpoint directory must already exist, LSF will not create the directory. The checkpoint period is specified in minutes. All jobs submitted to the queue will be automatically checkpointed. For example, to configure a queue to periodically checkpoint jobs every 4 hours (240 minutes) to a directory named my_dir:

Begin Queue
  ... 
  CHKPNT=my_dir 240
  DESCRIPTION=Auto chkpnt every 4 hrs (240 min) to my_dir
  ... 
End Queue

All jobs submitted to a queue configured for checkpointing can also be checkpointed using bchkpnt. Jobs submitted and modified using -k, -r, -p, and -kn options override queue configured options.

[ Top ]


Restarting Checkpointed Jobs

LSF can restart a checkpointed job on a host other than the original execution host using the information saved in the checkpoint file to recreate the execution environment. Only jobs that have been checkpointed successfully can be restarted from a checkpoint file. When a job is restarted, LSF performs the following actions:

  1. LSF re-submits the job to its original queue as a new job and a new job ID is assigned
  2. When a suitable host is available, the job is dispatched
  3. The execution environment is recreated from the checkpoint file
  4. The job is restarted from its last checkpoint.

This can be done manually from the command line, automatically through configuration, and when a job is migrated.

Requirements

LSF can restart a job from its last checkpoint on the execution host, or on another host if the job is migrated. To restart a job on another host, both hosts must:

Manually restarting jobs

Use the brestart command to manually restart a checkpointed job. To restart a job from its last checkpoint, specify the checkpoint directory and the job ID of the checkpointed job. For example, to restart a checkpointed job with job ID 123 from checkpoint directory my_dir:

% brestart my_dir 123
Job <456> is submitted to default queue <default>

The brestart command allows you to change many of the original submission options. For example, to restart a checkpointed job with job ID 123 from checkpoint directory my_dir and have it start from a queue named priority:

% brestart -q priority my_dir 123
Job <456> is submitted to queue <priority>

[ Top ]


Migrating Jobs

Migration is the process of moving a checkpointable or rerunnable job from one host to another host.

Checkpointing enables a migrating job to make progress by restarting it from its last checkpoint. Rerunnable non-checkpointable jobs are restarted from the beginning. LSF provides the ability to manually migrate jobs from the command line and automatically through configuration. When a job is migrated, LSF performs the following actions:

  1. Stops the job if it is running
  2. Checkpoints the job if it is checkpointable
  3. Kills the job on the current host
  4. Restarts or reruns the job on the next available host, bypassing all pending jobs

Requirements

To migrate a checkpointable job to another host, both hosts must:

Manually migrating jobs

Use the bmig command to manually migrate jobs. Any checkpointable or rerunnable job can be migrated. Jobs can be manually migrated by the job owner, queue administrator, and LSF administrator. For example, to migrate a job with job ID 123:

bmig 123
Job <123> is being migrated

% bhist -l 123
Job Id <123>, User <user1>, Command <my_job>
Tue Feb 29 16:50:27: Submitted from host <hostA> to Queue <default>, C
  WD <$HOME/tmp>, Checkpoint directory <chkpnt_dir/123>;
Tue Feb 29 16:50:28: Started on <hostB>, Pid <4705>;
Tue Feb 29 16:53:42: Migration requested;
Tue Feb 29 16:54:03: Migration checkpoint initiated (actpid 4746);
Tue Feb 29 16:54:15: Migration checkpoint succeeded (actpid 4746);
Tue Feb 29 16:54:15: Pending: Migrating job is waiting for reschedule;
Tue Feb 29 16:55:16: Started on <hostC>, Pid <10354>.

Summary of time in seconds spent in various states by Tue Feb 29 16:57:26
PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
62       0        357      0        0        0        419

Automatically Migrating Jobs

Automatic job migration works on the premise that if a job is suspended (SSUSP) for an extended period of time, due to load conditions or any other reason, the execution host is heavily loaded. To allow the job to make progress and to reduce the load on the host, a migration threshold is configured. LSF allows migration thresholds to be configured for queues and hosts. The threshold is specified in minutes.

When configured on a queue, the threshold will apply to all jobs submitted to the queue. When defined at the host level, the threshold will apply to all jobs running on the host. When a migration threshold is configured on both a queue and host, the lower threshold value is used. If the migration threshold is configured to 0 (zero), the job will be migrated immediately upon suspension (SSUSP).

You can use bmig at anytime to override a configured threshold.

Configuring queue migration threshold

To configure a migration threshold for a queue, edit lsb.queues and specify a threshold for the MIG parameter. For example, to configure a queue to migrate suspended jobs after 30 minutes:

Begin Queue 
  ... 
  MIG=30        # Migration threshold set to 30 mins 
  DESCRIPTION=Migrate suspended jobs after 30 mins
  ... 
End Queue

Configuring host migration threshold

To configure a migration threshold for a host, edit lsb.hosts and specify a threshold for the MIG parameter for a host. For example, to configure a host to migrate suspended jobs after 30 minutes:

Begin Host
  HOST_NAME   r1m   pg   MIG # Keywords
  ...
  hostA       5.0   18   30
  ...
End Host

Requeuing migrating jobs

By default, LSF restarts or reruns a migrating job on the next available host, bypassing all pending jobs.

You can configure LSF to requeue migrating jobs rather than immediately restarting them. Jobs will be requeued in PEND state and ordered according to their original submission time and priority. To requeue migrating jobs, edit lsf.conf and set LSB_MIG2PEND=1.

Additionally, you can configure LSF to requeue migrating jobs to the bottom of the queue by editing lsf.conf and setting LSB_MIG2PEND=1 and LSB_REQUEUE_TO_BOTTOM=1.

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: January 12, 2004
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.