Proxmox quality assurance

A glimpse at Proxmox Quality Assurance

Last updated May 21, 2025

What kind of testing procedures do they use at Proxmox and how does your bug-reporting fit into it? How consistent and thorough is regression testing before users get hold of a public package?

This post follows up on the previous finding that there is no difference in the eventual content no-subscription and test software repositories as publicly made available by Proxmox.

Routine

Every software house has some sort of testing routine (QA) to ensure the obviously bad versions of their packages never reach their user.

It starts with rudimentary unit tests that a developer is supposed to make and have accompany their newly written code, these would also help find out any regressions - unintended bugs that caused previously dependable features to stop working correctly as they did before. Otherwise, an individual developer would typically just be testing the part that they were implementing anew.

Further integration testing would typically cover any unintended interactions across interfaces, these could still be routinely run automated scripts on every new build, but also could be manual.

Then there is system tests performed with the full suite and by actual testers, i.e. dedicated personnel that does not have the bias of the original developers and possibly involves also automation, but closely resembling behaviour of real users.

This is all before final User Acceptance Test (UAT) - something only a customer (in a typical scenario) can sign on.

How well the first 3 are part of Proxmox culture is hard to determine, but following individual bugreports, it becomes clear there are some deficiencies.

Proxmox do have public Bugzilla instance, but it is apparent there’s no fixed process to follow once bugs get fixed to ensure full end-to-end testing in every individual case uniformly. When it comes to quality of work of individual developers, this can also vary vastly, e.g. there’s rigorous unit tests written for some new works, others have none at all, at least not published.

Unit tests

A prime example is pve-ha-manager, looking at its recent git log (excerpts only):

commit 34fe8e59eacb9107c76962ed12f6bea69195eb74 (HEAD -> master, origin/master, origin/HEAD)
Date:   Sun Nov 17 20:36:27 2024 +0100

    bump version to 4.0.6

commit 977ae288497fde04fb67bf25417ce54e77a29a63
Date:   Sun Nov 17 17:23:01 2024 +0100

    crm: get active if there are nodes that need to leave maintenance

commit 73f93a4f6b6662d106c32b433efabcc1f10dbc3a
Date:   Sun Nov 17 17:01:37 2024 +0100

    crm: get active if there are pending CRM commands

commit d0979e6dd064e6dc5a1292aa2c9b25c244500043
Date:   Sun Nov 17 16:35:22 2024 +0100

    env: add any_pending_crm_command method

commit afbfa9bafca0237785badb96f589524749fc937a
Date:   Sun Nov 17 16:34:48 2024 +0100

    tests: add more crm idle situations
    
    To test the behavior for when a CRM should get active or stay active
    (for a bit longer).
    
    These cases show the status quo, which will be improved on in the next
    commits.

commit ddd56db3463c3c7716072f6011070109df4a577a
Date:   Fri Oct 25 16:34:02 2024 +0200

    fix #5243: make CRM go idle after ~15 min of no service being configured

This was a bugfix in a non-trivial component relating to High Availability, committed October 25, 2024 and then almost a month later, unit tests were supplied, but in the same swoop, more changes and finally “bump version,” i.e. releasing package to the public just 3 hours following the last changes of November 17, 2024. The package has been made public soon after.

Ad hoc tests

In another instance, an SSH bugfix that aimed to go all-in with new intra-cluster communication setup (impact on migrations, replications, GUI proxy’ing console/shell connections, so quite a bit) was made in January 2024 and a regular member of development team (i.e. not a dedicated tester) got tasked to manually ad hoc test another one’s work (excerpt only):

 > Tested cluster creation with three new nodes on 8.1 and the patches                                         
 > Cluster creation and further ssh communication (eq. migration) worked                                       
 > flawless                                                                                                    
 
 What about the reinstallation of an existing node, or replacing                                               
 one, while keeping the same nodename scenario?                                                                
                                                                                                               
 As that was one of the main original reasons for this change here                                             
 in the first place.                                                                                           
                                                                                                               
 For the removal you could play through the documented procedure                                               
 and send a patch for update it accordingly, as e.g., the part                                                 
 about the node’s SSH keys remaining in the pmxcfs authorized_key                                              
 file would need some change to reflect that this is not true                                                  
 for newer setups (once this series is applied and the respective                                              
 packages got bumped and released).

This was then applied to public repositories in April 2024.

Then in May 2024, a user is filing a bugreport on a regression with QDevice setup regarding a “typo in command” - fixed in next minor version in May 2024.

Another bug in closely related almost forgotten-to-be-changed code was found only in October, fixed same day, but not included until weeks after this post originally appeared at the end of December 2024, again only with ad hoc testing.

Completely forgotten patches

There are worse cases still yet, however. A good example are conflicting packages with Debian. Despite discovered back in September 2024 - again only after user reported - and patch made readily available a month later, this was simply forgotten for good.

This is why, even as late as end of January 2025, you will be still encountering installation woes with Proxmox VE on top of Debian despite its repositories are used by Proxmox officially as is official the installation method and conflict resolution within packages would then be expected as a norm from the vendor.

No code reviews

Besides cases of easy to expose bugs that simply got unnoticed despite reports, there are major lapses at Proxmox when it comes to code reviews, most certainly of its historical code that got put together by only a small handful of original developers. One such almost as old as Proxmox itself could be causing silent configuration data corruption on your system to this day.

The takeaway

These are some of the testing procedures Proxmox use before releasing anything into their public repositories, however the distinction between what test packages are and what makes its way into no-subscription repository is blur - eventually, they contain identical packages, after all. The final acceptance test (UAT) inevitably happen with the public - widest user base possible - to offset any deficiencies that may have been overlooked, but this is part of the actual business model of Proxmox and it helps it stay free of any monetary cost to the user.

Post is also available as reStructuredText in a GitHub Gist.
Excuse limited formatting, absent referencing and missing media content.
Your feedback is welcome in comments therein.

Full feature set for free Software repositories