Code Library: Runbook checkpoints

Runbook checkpoints

The scripts used in Azure Automation are built on the Windows PowerShell Workflow model, which provides a powerful feature for checkpoints within the runbooks. By adding a checkpoint to a runbook, you increase its reliability to function despite transient errors, unexpected exceptions, service delays and outages, network downtime, and other issues that are commonly found in a distributed system such as Microsoft Azure for long-running and widely distributed resources. Using checkpoints allows you to confidently automate processes that span multiple networks and systems.

A checkpoint provides a persistence mechanism you can implement at various strategic points in the execution of the Windows PowerShell Workflow. If a problem occurs and the processing of the workflow is interrupted, it can be resumed again near the point of interruption. A checkpoint also ensures that an action will not occur more than once and have a negative duplicate effect. This is the concept of a workflow being idempotent; you can run the workflow more than once, but the result will be the same as if you only ran it once.

Checkpoints are used to persist the state of a running runbook to the Azure Automation database. Think of a checkpoint as a point-in-time picture that includes any presently generated output, any other implicit, serialized state information, and any existing values of any variables when the checkpoint view was taken. A checkpoint exists in the database until another checkpoint is taken, in which case the first checkpoint is overwritten, or until the runbook completes.

Overhead is associated with the placement of a checkpoint within a runbook. Each time a checkpoint is invoked, a serialization of data persists to storage. If you have a large Windows PowerShell workflow and add a number of checkpoints to it, workflow performance can suffer noticeably. So, although you could place a checkpoint before and after each line in a script file, be smart about your use of checkpoints so performance isn’t negatively affected.

Although there are no firm rules on where to put checkpoints, you should plan and strategize their placement within a workflow. If the time it takes to rerun a section of an interrupted workflow is greater than the time it takes to persist the checkpoint, that’s probably not a wise use of a checkpoint. Rather, it makes more sense to place a checkpoint after a good chunk of work is done by the workflow itself. This could be defined as making a call to a resource that might or might not be available or ready, calling a routine that takes a very long time to complete its work, or an operation that coordinates multiple distributed resources that are geographically distributed or are highly contended for by a number of processes.

Where you place checkpoints is specific to the workflow and its duties and performance constraints. You don’t want to persist a checkpoint when it’s not really necessary. Look at activities that might be prone to failure. You also want to avoid having to take the time and resources to do expensive work. Therefore, set checkpoints in the runbook at critical points, and ensure that any runbook restarts do not redo any work that has already completed. Also, you want to encompass any idempotent activities to make sure they don’t run more than once when the workflow resumes.

For example, your runbook might create an instance of a Microsoft Azure HDInsight Hadoop cluster with perhaps a hundred or so VMs to handle a big data issue with your script. You could set a checkpoint both before and after the commands to create the cluster. If the runbook fails during cluster creation, when the runbook is restarted, it will repeat the cluster creation work. If the creation succeeds but the runbook later fails, the HDInsight cluster will not be created again when the runbook is resumed.

Azure Automation limits the amount of time a runbook can execute to 30 minutes. Azure will unload a runbook that takes longer than that, assuming that something has gone wrong or the runbook is monopolizing the system. The runbook will eventually be reloaded, and you will want to resume it from where it left off. To ensure that the runbook will eventually complete, you should add checkpoints at intervals under the 30-minute limit.

By using the Checkpoint-Workflow activity within a Windows PowerShell workflow, you tell the system to immediately persist a checkpoint. If an error occurs and the workflow is suspended, the workflow will resume from the point of the latest checkpoint when the job is resumed. Checkpoint-Workflow is a simple call that does not take any parameters and can be placed before or after any workflow command. However, you can’t use the Checkpoint-Workflow activity within an inline block of code.

Source of Information : Azure Automation

Code Library

Categories

0 comments:

Leave a Reply