Cluster

Checkpoints (Snapshots) are great feature for saving the state of the Virtual Machine before making changes like applying patches or updates, installing or configuring applications etc. If new changes or updates break the system, we can easily go back to the previous state. To allow this, Hyper-V creates another virtual disk file to store all the changes (in simple words). This causes the file to grow in size and depending on how much change data is being created on the VM it can grow either fast or slow. If the location where the checkpoints are being stored runs out of available free space, Hyper-V will pause the VM. This not only affects the VM that has the growing checkpoint, but also has the potential to affect other VMs that are stored in the same location (Storage Volume/LUN). This process occurs so that the VMs don’t start failing disk writes within the OS of the VM and possibly cause file corruption. To recover from this we have to delete the checkpoints that are not needed anymore. Because of this we have to be very careful when creating checkpoints and remember to delete the checkpoints that are not needed any more.

To avoid this problem and keep a control on the checkpoints I developed a PowerShell script that checks all the checkpoints for all the VMs and deletes the checkpoints that are older X number of day. The attached script currently checks for 3 days or older checkpoints, but this can be easily changed. Also, there can be occasions where we want some checkpoints to stay e.g. if there is a maintenance that is taking longer and we want to keep those checkpoints for longer than 3 days. For adding this function I changed the script to check for a file called “CheckPointsException.txt”, this file will have the names of the all the VMs that have exceptions to keep their checkpoints longer than 3 days. So, for all the VMs that are listed in this file, my script will not delete the checkpoints. Once the maintenance is done we can remove the VM name from this file and next time the scripts runs it will delete the checkpoints for that VM. This script can be add in the Windows Task Scheduler to run every day, or whatever schedule is need. This script also creates an HTML report will a list of VMs with their checkpoints that are older than 3 days and their status (exceptions etc.). It also emails this report to the list of recipients defined in the script. So, this report acts as a reminder that there are checkpoints that are in exceptions list so they are not forgotten.

You can add the Hyper-V Clusters and Standalone hosts at the top of the script. You would also need to set the paths to the reports file and CheckPointsException.txt file.

You can download the script below and rename it to .ps1. Please send me comments.

We are using Hyper-V for virtualization. We have several Hyper-V clusters running hundreds of VMs. From time to time there is a need to the reboot Hyper-V host servers usually when we push updates to them or sometimes just to refresh them. We all know that Window OS likes regular reboots to refresh itself.

So, whenever we needed to reboot the host servers, we had to do them manually. The steps we had to do are:

  • Pause” first host/node and “Drain” roles
  • Once it is in “Pause” state, reboot it and wait for it come online
  • When the host is online, “Resume” it and move to the next host

This was a time consuming task therefore I decided to develop a script to automate this task. The script is attached and you can download and use it if you want.

Now let me explain a bit what this script does. As you can see I have added some output commands so we can we can monitor the execution and see what is being done at each step. The script takes the cluster name as an argument. You run the script using the command below:

.\RebootClusterHosts.ps1 -Cluster "ClusterName"

Replace the above “ClusterName” with the name of your cluster. At first the script gets a list of all the nodes in the cluster and then goes into a “for” loop to process each node. It firsts check if the node is up (there maybe one or more nodes down because of any issues or maintenance), if that node is not up then it skips it and moves to the next node. If the node is up, it starts the process to Drain the Roles/VMs from the node and pause it. It then waits until all the Roles/VMs are drained and the node is paused. When this process is complete, the script waits for 5 seconds before reading the status of the node (I added a “sleep” for 5 seconds just to give it a few more time to refresh). After that it checks the status of the process, it checks the node DrainStatus and node State. If DrainStatus is not “Complete” and State is not “Paused” that means something went wrong. We have seen issues where sometimes all the Roles/VMs are not successfully migrated to other nodes and this process fails. This is where the script prompts the user that draining was not successful and they need to drain the node manually and press [Enter] key when done, so the script execution can continue. So, here the user needs to drain the node manually, once done they can press [Enter] key.

If the draining was completed successfully, no user interaction is needed and the script continues to restart the node and then it waits for the node the come online. I added the switch to the Restart command to wait for PowerShell status to come online. I have seen that this check is not enough and sometimes the nodes take longer to fully become available after the reboot. That is why here I have also added a “sleep” for 15 seconds to give the node another 15 seconds. Then after that there is a while loop that checks the status of the node every 1 second until it becomes available. When the node becomes available, the script Resumes/UnPauses the node. I have used another while loop here where the script tries to resume the node and check it’s status after every 1 second until it’s status becomes “Up“. I have ran into the issue where sometimes trying to resume the node once did not go well therefore I added a while loop here.

When the node’s status becomes “Up” the script moves to the next node and does all that processing mentioned above on that node.

I have ran this script multiple times and have tuned it as I came across issues. You all are welcome to modify it according to your needs. If you find something that can be done better, please do let me know.

This script can be used for rebooting failover cluster hosts/nodes and not only Hyper-V cluster hosts/nodes.

Thanks all!

Download the script below.