Fragile Proxmox cluster management
Proxmox VE became well known for its support of clustered environments since very early on - since version 2.0 released in 2011. More specifically, these are quorum-based clusters without reliance on any single master or control node, which might appear appealing. A crucial component to achieve this model is provided by Corosync project despite its native tooling is typically hidden from the user. And so is understanding of some additional quirks Proxmox bring to the table with some of their design choices.
We had previously covered how Corosync merely facilitates communication amongst cluster nodes and that most of bespoke cluster-health related logic is actually provided by the virtual Proxmox Cluster filesystem - also dubbed pmxcfs. Making a clear distinction between the two would be important considering some odd behaviour one would not expect, not even if generally familiar with Corosync as such.
Management of cluster changes
Clusters can be conveniently formed and extended with new nodes via the graphical user interface (GUI) alone, some more options are provided by application programming interface (API) or - more pragmatically for the average user - the command-line interface (CLI).
Note
Both GUI and CLI typically make use of the API calls, actually, but CLI historically used to rely on direct command execution utilising SSH, which is still possible.
A perfect example that necessitates the use of CLI - provided by pvecm command set - as this does not have a corresponding GUI equivalent - would be the removal of cluster node(s), which in fact requires more elaborate steps. Whilst these (common) scenarios are well-documented, it often leaves a lot of the reasons for the requirements within to a guesswork.
Connections
The Corosync communication is meant to use its own network - at least in production setups - and redundancy (so-called separate links, or anachronistically - rings). Meanwhile, there’s at least two other types of network traffic a node may instigate:
- API calls handled by the service of
pveproxy
; and - SSH sessions that are meant to be phased out in favour of the API calls.
Note
GUI is essentially a single-page web application running in a browser that itself got served to the client by the same proxy that processes the API calls.
Single node cluster
There are single node installs, there are clusters and there are single nodes which are in a cluster - of their own. This is typically not an intended permanent state, but it does exist. The distinction between a single node and single clustered node - ready to accept new members - is important.
Under normal circumstances, the virtual filesystem would be receiving filesystem writes from the mountpoint of /etc/pve
and sharing them with the rest of the cluster - this is what the Corosync messages are delivering. When there’s no one to deliver to (and from), pmxcfs still provides its consistency guarantees to the node, but there is no need for any messaging with others. Thus, on a freshly installed node, there’s no Corosync service running and pmxcfs runs in a “local mode” - not expecting any.
Tip
You might be familiar with the “local mode” through its -l
switch
which is advised to use in some troubleshooting scenarios and in fact can be useful when pmxcfs refuses to provide a mount, such as to e.g. extract cluster configuration. That said, using this mode has its pitfalls and can turn counterproductive - more on that below.
When a new cluster is about to be formed, the initiating node would be “creating” a cluster - this is a separate operation, i.e. forming a cluster entails ramping up the first node and then pointing the second node to the first. Between these two operations, there’s the limbo state when the first node is in a cluster, but it only contains itself as a member in the configuration. Yet, Corosync service has been started and it is ready for others to join.
Important
This is completely arbitrary concept of Proxmox - the “initiating and joining party” - as is the ordering of operations and manual intervention necessary at the respective nodes. This has nothing to do with Corosync messaging setup itself.
Limitations for joining nodes
The fact that PVE performs checks, such as that joining nodes must be “empty” is a matter of avoiding extra overhead of merging existing instances of pmxcfs rather than anything to do with Corosync. The virtual filesystem held in /etc/pve
is essentially one and the same and kept in sync across all nodes.
Note
The database backend files are ALSO identical on all nodes, they carry no node-specific information in relation to the host they reside on. This is why cluster-wide configuration can be easily backed up from any of the nodes equally well.
When a new node is joining, its original database is ditched and replaced by the one from the cluster. As such, pmxcfs does not allow e.g. duplicate VM IDs (part of consistency guarantees it is meant to provide), so it would need to see to it that such get renumbered (if identical on multiple different non-empty nodes that are about to join) when joining - but it is easier to offload this responsibility onto the user.
Role of Corosync
Up to this point, it might have appeared that it made no sense to examine the pmxcfs specifics in relation to the clustering mechanism per se. The Corosync messages are used, amongst other things, to deliver the virtual filesystem contents - i.e. synchronise the cluster-wide state to any node that is a member.
Such synchronisation happens after a node had “fallen behind” the rest of the cluster, e.g. while temporarily down and again going online. We had previously been able to observe this staggered behaviour when setting up bespoke Corosync probe node - at first without any other services, later only launching an instance of the pmxcfs.
And the same synchronisation also happens in respect to any new member that is just being added - but do keep in mind that it can only happen once the service of Corosync is up and running, which on freshly installed (unclustered) node is not the case. Once ready - we will look closer at what this means below - it has to be started up.
Corosync configuration
Corosync is a service like any other, available from Debian repositories as a package
and under the same name - corosync
. It needs configuration - which is held in /etc/corosync/corosync.conf
- that is present as a regular file on a local filesystem on each individual node. But on a Proxmox VE install, you are typically advised NOT to edit this file, but the identical one stored in /etc/pve/corosync.conf
- this is the very virtual filesystem that is utilising Corosync to deliver messages across cluster of nodes to keep the appearance of a synchronised filesystem.
Important
The service that runs the instance of pmxcfs itself is pve-cluster
. For the purpose of this post, you may think of the two as interchangeable. It is important to keep in mind that this service, unlike Corosync, has to run on every node, even if non-clustered, in which case pmxcfs runs in the previously mentioned “local mode” without Corosync dependency.
There is the individual - potentially disparate - configuration files locally on each node off which Corosync service actually reads its configuration - and there is the single shared version in the virtual path - mounted only after fully successful boot of a cluster node.
The configuration of a service vital for synchronising all configuration files - including that of the very service - is meant to be synchronised by the service itself. If you are sensing issues, you are onto something here.
Setup
If you were to install regular Corosync on Debian, it would come with default corosync.conf
looking very rudimentary, but functional - and the service will be running (only nodes
section included, stripped of comments):
nodelist {
node {
name: node1
nodeid: 1
ring0_addr: 127.0.0.1
}
}
This will NOT feel familiar as Proxmox do not populate the configuration file until a new cluster is created (from a sole node) and then keeps adding records of new nodes and have the service start when ready only. Essentially, stock Proxmox install does NOT contain any corosync.conf
file so as not to start the otherwise always present service.
Upon cluster creation, whether through GUI or via CLI with pvecm create <cluster name>
command, a new corosync.conf
is written to the shared /etc/pve
path by the API handler.
Corosync service needs to be started, but before that - as it would otherwise fail - the local version of the configuration needs to be copied over from the shared location into /etc/corosync/
directory.
Note
We are abstracting from other steps at this point, most importantly cryptographic keys setup, including - for a cluster founding node - an independent key for Corosync - authkey
- that is generated and then used to guarantee authenticity and confidentiality for communication across cluster for Corosync alone. These are out of scope here as they have no bearing on the aspects we are going to take a closer look at below.
Counterintuitively, the whole pmxcfs is restarted - it was running in the “local mode” (without Corosync) and needs to be able to start receiving messages (potentially - from cluster nodes presumably about to be added soon after). It is pmxcfs - not the API handler - that will copy the configuration over to the local directory of /etc/corosync/
upon its start. It is designed to notice the newly created corosync.conf
in /etc/pve
and act on with no regard to how it got there.
Caution
This is exactly the reason why it is generally futile to go about editing local corosync.conf
files on individual nodes - they would be overwritten, under certain circumstances - which we will look at further below - once pmxcfs restarts.
Adding nodes
From the special case of “single node cluster”, the next natural next step leads to adding the second (and further) nodes. When initiated from the joining node, either via GUI or with pvecm add <existing node>
, an API call is made to an already “clustered” node - that has already had its Corosync set up, albeit with trivial configuration of itself only. The call asks for addition for the new member to the cluster configuration. This is just to make the already clustered node (or nodes - for already “grown up” clusters) to expect another one to be showing up shortly as a member.
In the request, the joining node informs (any single) existing cluster node of its own details and the response to the API call includes the new corosync.conf
file (as well as the authkey
, which is out of scope here, but without which it would be impossible to join) for the joining node to utilise in the same setup procedure as was described for when the first cluster node was created - the exception being that the configuration was received and not newly generated. The service of pve-cluster
needs to be restarted as well as corosync
needs to be started on the previously barren node.
Caution
Also restarted in this case are the services of pveproxy
and pvedaemon
- the service that is being proxied. Whilist this is not important insofar cluster configuration, this is the very reason why GUI becomes inaccessible for the freshly joined node - which is a pity as the final GUI message would have definitely been possible to deliver.
Whilst joining node would have now received the newly conceived corosync.conf
(from a cluster member that incorporated it into the members list), there’s a new issue to tackle that comes up when there’s already more than a single node in a clustered setup. The new configuration needs to be ALSO shared amongst the original cluster nodes, otherwise only the joining node and the specific node that was targeted by the API call would know about this change, the remaining cluster nodes would not.
The API call handler that took care of constructing (and returning) the newly updated Corosync configuration (with the addition of the fresh member) also needs to ensure that this information will get distributed to any other nodes that still run Corosync with the old configuration. But it doesn’t do that, or rather, it simply writes it into the shared /etc/pve/corosync.conf
location - which at this point is still only accessible to the old nodes.
Self-distribution by pmxcfs
Whilst pmxcfs does not hesitate to pick up corosync.conf
from its virtual filesystem database upon its (re)start and overwrite the local version from which Corosync actually reads its configuration, it provides one more not so intuitive feature.
As a virtual filesystem, it can sense when a file - such as corosync.conf
has changed and act on it. This means that not only the configuration file - after all, as would any other file - gets distributed across nodes (the original cluster members without the new joining node), but once this happens each instance of pmxcfs on each node will trigger an event - to copy the new shared file over the version and subsequent restart of the Corosync service. Again, this is how every instance of pmxcfs behaves, so it would occur on each node.
Perks with gotchas
Corosync itself would not react to simple edit of /etc/corosync/corosync.conf
file, it would need to be reloaded or restarted. But Proxmox stack triggers not only the overwrite - based on new /etc/pve/corosync.conf
, but also the reload - on any single file save event.
This warrants the unnatural requirement on how exactly one should edit the shared file:
The configuration will get updated automatically, as soon as the file changes. This means that changes which can be integrated in a running corosync will take effect immediately. Thus, you should always make a copy and edit that instead, to avoid triggering unintended changes when saving the file while editing.
Warning
User is then advised on how exactly to edit a copied placeholder file instead first and only then move it into place - to ensure no partial file, as e.g. auto-save feature in an editor at the wrong moment would trigger Corosync services across cluster to fail. But there is more.
If distributing configuration of a distributed system by the system itself sounds like a recipe for a special broth, that’s because it is. An obvious problem is that any nodes which were not online at the time of such change would simply miss out this configuration change despite they were not meant to be “dropped”.
Note
An administrator, whilst in charge of the configuration, might not exactly be aware which node where might be e.g. undergoing a reboot or worse - not to manage to go fully through such change.
Another issue lies in the fact that should anything fail, whether on all or only some of the nodes, those effected would immediately become orphaned, i.e. without quorum. There is no simple rollback option, at all. This is particularly bad if those nodes were making use of High Availability (HA) as this would cause them to perpetually reboot as they would never be able to achieve quorum again without manual intervention - which is not exactly intuitive.
High Availability catch
At this point, one might wonder why the corosync.conf
distribution, or at least notification of the change, is not performed via regular API calls. Besides the fact that Proxmox have high propensity to use their virtual filesystem for distributing everything, a series of API calls would not guarantee what Virtual Synchrony does, i.e. that the change would appear atomic across all the nodes - well, at least quorate ones.
The issue for Proxmox is the need to consider the intricacies of their HA stack again - if the propagation of such change would take longer (or the joining node would undergo a technical problem mid-through it), some nodes would be running on the new cluster members list while others the old - remaining in such state over even just several seconds is intolerable as active HA would go on rebooting such nodes, which in turn would never even receive or manage to complete such API call. In fact, the caller node itself might have gotten the plug pulled off itself during this transition just as likely.
One more quirk
Proxmox double down on their distribution approach and introduce further requirement, or rather a warning - without much rationale, captured by the same documentation piece:
Always increment the
config_version
number after configuration changes; omitting this can lead to problems.
Indeed, this is also what every automated (i.e. API call instigated) change to cluster configuration does as well. But why is it so? The configuration version number held in config_version
field
(not to be confused with the plain version
field that relates to the format of the file) is not a Proxmox construct, this is a regular Corosync staple:
By default it’s 0. Option is used to prevent joining old nodes with not up-to-date configuration. If value is not 0, and node is going for first time (only for first time, join after split doesn’t follow this rules) from single-node membership to multiple nodes membership, other nodes config_versions are collected. If current node config_version is not equal to highest of collected versions, corosync is terminated.
So this is meant to prevent old nodes from re-joining a cluster. By why does it it need “always” increasing with Proxmox stack then? Proxmox increment the config version number on any change simply because they HAVE TO due to internal logic of pmxcfs which “hijacked” (or overloaded, if you will) the original intention behind this field.
When a newly saved corosync.conf
is detected in /etc/pve
, a completely artificial condition check that determines whether it is going to overwrite the local node configuration file evaluates if the version (literally - the value of config_version
) in the shared file is higher than the one in the local file. If it is not, it would simply fail to do so with an undescript message:
local corosync.conf is newer
It would similarly fail if such value is not present - which would constitute an otherwise completely valid configuration file for Corosync as such.
Point of no return
We know that a faulty configuration file would break the cluster apart, but as long as the config_version
field got incremented “correctly”, such file would get distributed and allowed to cause carnage - there is no additional prevention mechanism provided by Proxmox or benefit from checking this field in relation to cluster health preservation.
The only thing left to consider: this a last resort effort to enable at least some way to override faulty corosync.conf
that got stuck inside the pmxcfs database - as this file would not be editable during a lost quorum situation and in order to achieve quorum with the manually provided local file.
Warning
Whilst it is possible to mount pmxcfs in local mode with the -l
switch,
there is a huge caveat to getting isolated access to a filesystem that is meant to be shared amongst multiple nodes. If any changes are made to such single instance of the filesystem database, it would affect the versioning information within, which would then make it impossible to reconcile with all the other nodes in the cluster. If you ever decide to make such modifications, consider them only for orphaned nodes or be prepared to manually distribute that new database across all the remaining nodes.
In any event, should you find yourself in a broken cluster configuration situation that already got distributed, even partially - it is a point of no return - there’s no other way out than manually fixing up the local corosync.conf
files on every single node (with higher versionin number so as not to have it overridden) and restarting the services, with special considerations given to clusters with active HA.
Real life bugs
But besides the approach of distributing a distributed system’s new configuration using its old configuration and hoping for the best - all within a time-frame short enough so as not to trigger HA-induced auto-reboots - is iffy, there actually are palpable consequences to this, including previously reported by users.
Up to this point, we have mostly considered situations which are at least partially caused by the user, i.e. it would be possible to argue that the system is simply imperfect - not resilient to unexpected inputs. But if you ever find yourself in a situation that you make a change to Corosync configuration which is otherwise completely valid (or just add a new node to the cluster, via provided means) and get all your other nodes self-fenced, it might be just because one of the consequences of the complexity that has been described previously.
When there is sufficient delay between the moment of new corosync.conf
having gotten distributed (i.e. visible in /etc/pve/
across all nodes) and then the local version of it overwritten, it will simply reload the (still not replaced) same local file and get dropped from the cluster. With HA active, this would trigger a reboot, upon which only the file gets overwritten just before the service starts with the new version. This was experienced by a real user,
in fact probably by many, but not easily recognised as related.
Workarounds for good
This issue has existed since the very inception of Proxmox and in fact, the workaround applied in 2022 fix was to introduce an artificial delay of 1 second between the two events, which merely lowers the risk of encountering it.
The piece of code literally bears this warning to this day:
if (notify_corosync && old_version) {
/*
* sleep for 1s to hopefully allow new config to propagate
* FIXME: actually query the status somehow?
*/
sleep(1);
/* tell corosync that there is a new config file */
cfs_debug ("run corosync-cfgtool -R");
int status = system("corosync-cfgtool -R >/dev/null 2>&1");
And the excerpt from the commit message:
if processing a corosync.conf update is delayed on a single node, reloading the config too early can have disastrous results (loss of token and HA fence). artificially delay the reload command by one second to allow update propagation in most scenarios until a proper solution (e.g., using broadcasting/querying of locally deployed config versions) has been developed and fully tested.
Summarised at the bump to next version:
* corosync.conf sync: add heuristic to wait for propagation of change to all
nodes before triggering a cluster wide config reload
And part of the final release notes of v7.2 merely as:
Fix race-condition between writing corosync.conf and reloading corosync on update
The bug is still present - it is a design flaw which is not trivially fixable. And whilst it would be possible to do so within the current approach of using Corosync for all the messaging, there’s other issues which make it not worthwhile - as opposed to changing the approach altogether.
Lacking cluster integrity
Whilst the versioning approach may indeed be used to somewhat reliably drop deleted cluster nodes (which is needed if the affected node itself is e.g. inaccessible), there is a much more reliable way to achieve the same.
Corosync provides for authentication key that is shared across nodes and in fact Proxmox do make use of it - it is the authkey
file also stored in /etc/pve/priv/
- again for distribution. Rotating this key, since already in use, is the most reliable way to prevent dropped, stale or even nodes that went haywire to disrupt the Corosync links. But this method is NOT used.
The key is simply never rotated - as a precaution. The most serious concern is clearly that should anything go wrong during the delicate operation, the cluster would remain broken. There’s no way to deliver a rollback message, certainly not when depending on the non-functional Corosync links.
This does potentially also have security implications, but they are out of scope here.
The way out
First of all, much better way of performing any updates to Corosync configuration would need to rely on something else than the cluster being healthy. Only then could the cluster be resilient and even potentially self-healing. Not even rigorous checks on corosync.conf
updates could guarantee that every single node with reload correctly its new configuration, so this rules our Corosync as a method to distribute its own configuration changes.
High Availability suspension
As HA is heavily dependent on quorum and can cause auto-reboots, which in extreme circumstances may cause entire cluster to fail, the only sensible way to safely perform ANY Corosync link configuration changes would be to deactivate the HA stack prior to any such operation and only reactivate it afterwards. Something completely missing from GUI and in fact missing from the from the stock implementation as such.
Control node philosophy
A concept of control host - at least temporary one - needs to be taken into account. Whilst this is something entirely alien to Proxmox VE today, it is not far-fetched even with just rudimentary SSH properly set up - this would be necessary for reasonable automation, including for rollbacks on a timeout, for instance.
A control host does not even need to be any of the cluster nodes, it could be a management host and not an active part of the cluster that could hold the backup configuration, the prepared new configuration and distribute it via means other than the Corosync link, of course. After a successful completion of its task, such node might as well be offline.
Once in place and only then, Corosync service could be simply restarted on all nodes - timing of which would not matter as much - and only once all of them (or at least a quorum equivalent portion) get successfully updated, the HA could be reactivated again. Nodes which were e.g. offline during such update, but still configured to have it applied, could pick up the new configuration (possibly even on demand) later on from the control node or other nodes that are already healthy despite they do not share a viable Corosync link.
Note
A sophisticated solution that makes use of API could still allow to make use of master-less approach, i.e. at any point any of the nodes could become the controlling one for the time necessary for the update. But a much simpler approach for the average user - as you are on your own - would be e.g. based on SSH and in case of larger cluster, Ansible.
Final notes
As much as some of the above might have created an illusion of complexity, setting up Corosync is as simple as populating a good configuration file on each host (and shared authentication key) - by any reliable means.
Lots of the users would have also noticed that they already have an “off cluster node” - a QDevice would double as such easily. Coincidentally, it is also one that is not even covered by automated (i.e. provided by Proxmox) updating of its configuration as it does not feature an instance of pmxcfs at all.
If you had followed some of the past posts here, you now might have a very good idea how to manage clusters without any API calls, a concept to tackle that usually starts with deploying them. We might have a look at one such example soon and we shall start with deploying a fresh one to begin with.
Excuse limited formatting, absent referencing and missing rich content.
Your feedback is welcome in comments therein.