Failover Cluster

All posts tagged Failover Cluster

We are using Hyper-V for virtualization. We have several Hyper-V clusters running hundreds of VMs. From time to time there is a need to the reboot Hyper-V host servers usually when we push updates to them or sometimes just to refresh them. We all know that Window OS likes regular reboots to refresh itself.

So, whenever we needed to reboot the host servers, we had to do them manually. The steps we had to do are:

  • Pause” first host/node and “Drain” roles
  • Once it is in “Pause” state, reboot it and wait for it come online
  • When the host is online, “Resume” it and move to the next host

This was a time consuming task therefore I decided to develop a script to automate this task. The script is attached and you can download and use it if you want.

Now let me explain a bit what this script does. As you can see I have added some output commands so we can we can monitor the execution and see what is being done at each step. The script takes the cluster name as an argument. You run the script using the command below:

.\RebootClusterHosts.ps1 -Cluster "ClusterName"

Replace the above “ClusterName” with the name of your cluster. At first the script gets a list of all the nodes in the cluster and then goes into a “for” loop to process each node. It firsts check if the node is up (there maybe one or more nodes down because of any issues or maintenance), if that node is not up then it skips it and moves to the next node. If the node is up, it starts the process to Drain the Roles/VMs from the node and pause it. It then waits until all the Roles/VMs are drained and the node is paused. When this process is complete, the script waits for 5 seconds before reading the status of the node (I added a “sleep” for 5 seconds just to give it a few more time to refresh). After that it checks the status of the process, it checks the node DrainStatus and node State. If DrainStatus is not “Complete” and State is not “Paused” that means something went wrong. We have seen issues where sometimes all the Roles/VMs are not successfully migrated to other nodes and this process fails. This is where the script prompts the user that draining was not successful and they need to drain the node manually and press [Enter] key when done, so the script execution can continue. So, here the user needs to drain the node manually, once done they can press [Enter] key.

If the draining was completed successfully, no user interaction is needed and the script continues to restart the node and then it waits for the node the come online. I added the switch to the Restart command to wait for PowerShell status to come online. I have seen that this check is not enough and sometimes the nodes take longer to fully become available after the reboot. That is why here I have also added a “sleep” for 15 seconds to give the node another 15 seconds. Then after that there is a while loop that checks the status of the node every 1 second until it becomes available. When the node becomes available, the script Resumes/UnPauses the node. I have used another while loop here where the script tries to resume the node and check it’s status after every 1 second until it’s status becomes “Up“. I have ran into the issue where sometimes trying to resume the node once did not go well therefore I added a while loop here.

When the node’s status becomes “Up” the script moves to the next node and does all that processing mentioned above on that node.

I have ran this script multiple times and have tuned it as I came across issues. You all are welcome to modify it according to your needs. If you find something that can be done better, please do let me know.

This script can be used for rebooting failover cluster hosts/nodes and not only Hyper-V cluster hosts/nodes.

Thanks all!

Download the script below.