HA disable - maintenance

How to disable HA for maintenance

Last updated
Avoid unexpected non-suspect node reboot during maintenance in any High Availability cluster. No need to wait for any grace periods until it becomes inactive by itself, no uncertainties.

If you are going to perform any kind of maintenance works which could disrupt your quorum cluster-wide (e.g. network equipment, small clusters), you would have learnt this risks seemingly random reboots on cluster nodes with (not only) active HA services.

Tip

The rationale for this snippet is covered in a separate post on High Availability related watchdog that Proxmox employ on every single node at all times.

To safely disable HA without additional waiting times and avoiding HA stack bugs, you will want to perform the following:

Before the works

Once (on any node):

mv /etc/pve/ha/{resources.cfg,resources.cfg.bak}

Then on every node:

systemctl stop pve-ha-crm pve-ha-lrm
# check all went well
systemctl is-active pve-ha-crm pve-ha-lrm
# confirm you are ok to proceed without risking a reboot
test -d /run/watchdog-mux.active/ && echo nook || echo ok

After you are done

Reverse the above, so on every node:

systemctl start pve-ha-crm pve-ha-lrm

And then once all nodes are ready, reactivate the HA:

mv /etc/pve/ha/{resources.cfg.bak,resources.cfg}

Manual pages:  test
Your feedback on the content is welcome in the GitHub Discussions.