A glimpse at Proxmox Quality Assurance
This post follows up on the previous finding that there is no difference in the eventual content no-subscription and test software repositories as publicly made available by Proxmox.
Routine
Every software house has some sort of testing routine (QA) to ensure the obviously bad versions of their packages never reach their user.
It starts with rudimentary unit tests that a developer is supposed to make and have accompany their newly written code, these would also help find out any regressions - unintended bugs that caused previously dependable features to stop working correctly as they did before. Otherwise, an individual developer would typically just be testing the part that they were implementing anew.
Further integration testing would typically cover any unintended interactions across interfaces, these could still be routinely run automated scripts on every new build, but also could be manual.
Then there is system tests performed with the full suite and by actual testers, i.e. dedicated personnel that does not have the bias of the original developers and possibly involves also automation, but closely resembling behaviour of real users.
This is all before final User Acceptance Test (UAT) - something only a customer (in a typical scenario) can sign on.
How well the first 3 are part of Proxmox culture is hard to determine, but following individual bugreports, it becomes clear there are some deficiencies.
Proxmox do have public Bugzilla instance, but it is apparent there’s no fixed process to follow once bugs get fixed to ensure full end-to-end testing in every individual case uniformly. When it comes to quality of work of individual developers, this can also vary vastly, e.g. there’s rigorous unit tests written for some new works, others have none at all, at least not published.
Unit tests
A prime example is pve-ha-manager
, looking at its recent git log
(excerpts only):
commit 34fe8e59eacb9107c76962ed12f6bea69195eb74 (HEAD -> master, origin/master, origin/HEAD)
Date: Sun Nov 17 20:36:27 2024 +0100
bump version to 4.0.6
commit 977ae288497fde04fb67bf25417ce54e77a29a63
Date: Sun Nov 17 17:23:01 2024 +0100
crm: get active if there are nodes that need to leave maintenance
commit 73f93a4f6b6662d106c32b433efabcc1f10dbc3a
Date: Sun Nov 17 17:01:37 2024 +0100
crm: get active if there are pending CRM commands
commit d0979e6dd064e6dc5a1292aa2c9b25c244500043
Date: Sun Nov 17 16:35:22 2024 +0100
env: add any_pending_crm_command method
commit afbfa9bafca0237785badb96f589524749fc937a
Date: Sun Nov 17 16:34:48 2024 +0100
tests: add more crm idle situations
To test the behavior for when a CRM should get active or stay active
(for a bit longer).
These cases show the status quo, which will be improved on in the next
commits.
commit ddd56db3463c3c7716072f6011070109df4a577a
Date: Fri Oct 25 16:34:02 2024 +0200
fix #5243: make CRM go idle after ~15 min of no service being configured
This was a bugfix in a non-trivial component relating to High Availability, committed October 25, 2024 and then almost a month later, unit tests were supplied, but in the same swoop, more changes and finally “bump version,” i.e. releasing package to the public just 3 hours following the last changes of November 17, 2024. The package has been made public soon after.
Ad hoc tests
In another instance, an SSH bugfix that aimed to go all-in with new intra-cluster communication setup (impact on migrations, replications, GUI proxy’ing console/shell connections, so quite a bit) was made in January 2024 and a regular member of development team (i.e. not a dedicated tester) got tasked to manually ad hoc test another one’s work (excerpt only):
> Tested cluster creation with three new nodes on 8.1 and the patches
> Cluster creation and further ssh communication (eq. migration) worked
> flawless
What about the reinstallation of an existing node, or replacing
one, while keeping the same nodename scenario?
As that was one of the main original reasons for this change here
in the first place.
For the removal you could play through the documented procedure
and send a patch for update it accordingly, as e.g., the part
about the node’s SSH keys remaining in the pmxcfs authorized_key
file would need some change to reflect that this is not true
for newer setups (once this series is applied and the respective
packages got bumped and released).
This was then applied to public repositories in April 2024.
Then in May 2024, a user is filing a bugreport on a regression with QDevice setup regarding a “typo in command” - fixed in next minor version in May 2024.
Another bug in closely related almost forgotten-to-be-changed code was found only in October, fixed same day, but not included until weeks after this post originally appeared at the end of December 2024, again only with ad hoc testing.
Completely forgotten patches
There are worse cases still yet, however. A good example are conflicting packages with Debian. Despite discovered back in September 2024 - again only after user reported - and patch made readily available a month later, this was simply forgotten for good.
This is why, even as late as end of January 2025, you will be still encountering installation woes with Proxmox VE on top of Debian despite its repositories are used by Proxmox officially as is official the installation method and conflict resolution within packages would then be expected as a norm from the vendor.
No code reviews
Besides cases of easy to expose bugs that simply got unnoticed despite reports, there are major lapses at Proxmox when it comes to code reviews, most certainly of its historical code that got put together by only a small handful of original developers. One such almost as old as Proxmox itself could be causing silent configuration data corruption on your system to this day.
The takeaway
These are some of the testing procedures Proxmox use before releasing anything into their public repositories, however the distinction between what test packages are and what makes its way into no-subscription repository is blur - eventually, they contain identical packages, after all. The final acceptance test (UAT) inevitably happen with the public - widest user base possible - to offset any deficiencies that may have been overlooked, but this is part of the actual business model of Proxmox and it helps it stay free of any monetary cost to the user.
Excuse limited formatting, absent referencing and missing media content.
Your feedback is welcome in comments therein.