Proxmox VE 7.4 β†’ 9.x Cluster Node Upgrade

Share
Proxmox VE 7.4 β†’ 9.x Cluster Node Upgrade

Table of Contents

πŸ’‘
This post assumes you've Proxmox VE 7.4.x, non-enterprise without ceph
  • Pre-flight Checks
  • Phase 1 - Backup the VM to /backup
  • Phase 2 - Session Safety
  • Phase 3 - Upgrade 7.4 to 8.x
  • Phase 4 - Reboot and Verify on 8.x
  • Phase 5 - Upgrade 8.x to 9.x
  • Phase 6 - Reboot and Verify on 9.x
  • Phase 7 - Post-Upgrade Cleanup
  • Rollback Plan
  • Gotchas

Pre-flight Checks

# Confirm version baseline
pveversion
# Expected: pve-manager/7.4-x

# Confirm cluster membership and quorum
pvecm status

# Confirm node is healthy
systemctl --failed
journalctl -p err -b --no-pager | tail -50

# Confirm /backup is writable and has space
df -h /backup
mount | grep /backup
touch /backup && rm /backup/.write-test

# List VMs on this node
qm list
⚠️ Do not proceed if pvecm status shows the cluster as not quorate, or if systemctl --failed shows critical services down.

Phase 1 - Backup the VM to /backup

1.0 Run this from any node

pvecm delnode NODE

1.1 Get the node out of the cluster

systemctl stop pve-cluster && systemctl stop corosync
pmxcfs -l
rm -rf /etc/corosync/authkey /etc/corosync/corosync.conf /etc/pve/corosync.conf /etc/corosync/uidgid.d/
killall pmxcfs
sleep 2
systemctl start pve-cluster

Mask corosync

systemctl stop corosync
systemctl disable corosync
systemctl mask corosync

Confirm VMs still running

qm list
pvecm status   # should error with "no cluster info"

Clean leftover state in /etc/pve

for n in $(ls /etc/pve/nodes/ | grep -v "^$(hostname)$"); do
    rm -rf /etc/pve/nodes/$n
done

Wipe stale auth state

> /etc/pve/priv/known_hosts
> /etc/pve/priv/authorized_keys
rm -f /etc/pve/priv/authorized_keys.tmp.*
rm -f /etc/pve/priv/known_hosts.[a-zA-Z]*

Trim storage.cfg β€” keep only this node's local storages

nano /etc/pve/storage.cfg
# Keep: local, <node>, any local-backup
# Drop: every entry pinned to a different node
# Drop the `nodes <name>` constraint from kept entries

1.3 Mount /backup or add the folder

mkdir -p /backup/$(hostname)-snapshots

1.4 Add /backup as a Proxmox storage target

# Add /backup as a directory-type storage, restricted to this node, accepts vzdump backups
pvesm add dir local-backup \
    --path /backup \
    --content backup,iso,snippets \
    --nodes $(hostname) \
    --shared 0

# Verify
pvesm status | grep local-backup

1.5 Full vzdump backup to /backup

VMID=$(qm list | awk 'NR>1 {print $1}')

# Snapshot mode = no downtime, zstd = fast compression
for vmid in $(qm list | awk 'NR>1 {print $1}'); do
    echo "=== Backing up VMID $vmid ==="
    vzdump $vmid \
        --mode snapshot \
        --compress zstd \
        --storage local-backup \
        --notes-template "Pre PVE 7.4β†’9.x upgrade {{node}} {{vmid}}"
done

# Verify backup file exists
ls -lh /backup/dump/vzdump-qemu-*.vma.zst
qm list | awk 'NR>1 {print $1}' | wc -l
ls /backup/dump/vzdump-qemu-*.vma.zst | wc -l

1.6 Backup cluster and node configuration

mkdir -p /backup/host-config/$(hostname)-$(date +%F)
BACKUP_DIR=/backup/host-config/$(hostname)-$(date +%F)

# Cluster-aware config (in /etc/pve, FUSE-mounted from cluster DB)
tar czf $BACKUP_DIR/etc-pve.tgz /etc/pve 2>/dev/null

# Node-local critical configs
tar czf $BACKUP_DIR/etc-node.tgz \
    /etc/network/interfaces \
    /etc/hosts \
    /etc/hostname \
    /etc/resolv.conf \
    /etc/corosync \
    /etc/ssh \
    /etc/apt \
    /etc/fstab \
    /etc/default/grub \
    /etc/lvm 2>/dev/null

# Package list (for diffing later if something breaks)
dpkg -l > $BACKUP_DIR/dpkg-list-pre74.txt
apt-mark showhold > $BACKUP_DIR/apt-holds.txt

ls -lh $BACKUP_DIR

1.7 Sanity-check the backup is restorable

# Inspect the vzdump archive header without extracting
ls -lh /backup/dump/
zstd -t /backup/dump/*.zst && echo "Archive integrity OK"

# Confirm Proxmox can see the backup in its UI listing
pvesm list local-backup

Phase 2 - Session Safety

2.1 Open screen

screen
⚠️ Mandatory. A dropped SSH session during dist-upgrade leaves dpkg in a broken state.

2.2 Quiesce scheduled jobs

# Stop scheduled backups, replication, etc. for the duration
systemctl stop pvescheduler 2>/dev/null || systemctl stop pve-daily-update.timer
systemctl stop cron
Restart these after Phase 6 verification passes.

Phase 3 - Upgrade 7.4 to 8.x

3.0 Add Proxmox repos if not there

grep -rq "pve-no-subscription" /etc/apt/sources.list.d/ /etc/apt/sources.list 2>/dev/null || {
  sed -i 's/^deb/#deb/' /etc/apt/sources.list.d/pve-enterprise.list
  cat > /etc/apt/sources.list <<'EOF'
deb http://deb.debian.org/debian bullseye main contrib
deb http://deb.debian.org/debian bullseye-updates main contrib
deb http://security.debian.org/debian-security bullseye-security main contrib
EOF
  echo "deb http://download.proxmox.com/debian/pve bullseye pve-no-subscription" \
    > /etc/apt/sources.list.d/pve-no-subscription.list
}

3.1 Patch 7.4 to absolute latest

apt update
apt dist-upgrade
# Review prompts, do NOT use -y

3.2 Run the official 7-to-8 checker

pve7to8 --full
⚠️ Do not proceed if any check returns FAIL. Common fixes:Stale corosync config: pull from another working nodeOld GPG keys: re-add Proxmox repo key with signed-byDeprecated storage configs: edit /etc/pve/storage.cfg

3.3 Swap repositories: Bullseye β†’ Bookworm

# Debian base repos
sed -i 's/bullseye/bookworm/g' /etc/apt/sources.list

# Debian security suite renamed in bookworm
sed -i 's|bullseye-security|bookworm-security|g' /etc/apt/sources.list
sed -i 's|bullseye/updates|bookworm-security|g' /etc/apt/sources.list

# Proxmox no-subscription repo (use enterprise line if you have a sub)
cat > /etc/apt/sources.list.d/pve-install-repo.list <<'EOF'
deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription
EOF

# Remove enterprise stub if no subscription
[ ! -f /etc/apt/sources.list.d/pve-enterprise.list ] || \
    rm /etc/apt/sources.list.d/pve-enterprise.list

# Refresh Proxmox release GPG key under bookworm scheme
wget -qO /etc/apt/trusted.gpg.d/proxmox-release-bookworm.gpg \
    https://enterprise.proxmox.com/debian/proxmox-release-bookworm.gpg

# Verify key fingerprint
sha512sum /etc/apt/trusted.gpg.d/proxmox-release-bookworm.gpg
# Expected: 7da6fe34168adc6e479327ba517796d4702fa2f8b4f0a9833f5ea6e6b48f6507a6da403a274fe201595edc86a84463d50383d07f64bdde2e3658108db7d6dc87

3.4 Update package index

apt update
⚠️ Fix any "NO_PUBKEY" or signature errors here, before dist-upgrade.

3.5 Run the 7β†’8 dist-upgrade

# It will ask for permission to restart services, you can click yes/no, doesn't matter
apt dist-upgrade

Conf file prompts to expect:

File Recommended action
/etc/ssh/sshd_config Keep local (N) β€” preserve your hardening
/etc/lvm/lvm.conf Keep local (N)
/etc/default/grub Keep local (N)
/etc/issue, /etc/issue.net Take maintainer's
/etc/corosync/* Keep local (N) β€” never overwrite cluster config
Anything in /etc/pve/* Won't prompt, cluster-managed
/etc/apt/sources.list.d/pve-enterprise.list Take maintainer's

3.6 Verify userspace upgrade succeeded

pveversion
# Expected: pve-manager/8.x

apt list --upgradable
# Should be empty or near-empty

systemctl --failed

Phase 4 - Reboot and Verify on 8.x

4.1 Pre-reboot snapshot of running services

qm list > /backup/host-config/vms-pre-reboot-8x.txt
pvecm status > /backup/host-config/cluster-pre-reboot-8x.txt

4.2 Shut down the VM cleanly

VMID=$(qm list | awk 'NR>1 {print $1}')
for vmid in $(qm list | awk '$3=="running" {print $1}'); do
    echo "Shutting down $vmid..."
    qm shutdown $vmid --timeout 180 &
done
# Wait and verify
qm list

# Force-stop any stragglers (rare, but covers stuck guests)
for vmid in $(qm list | awk '$3=="running" {print $1}'); do
    echo "Force-stopping stuck VM $vmid..."
    qm stop $vmid
done
⚠️ Don't stop, use shutdown β€” sends ACPI signal, lets the guest flush filesystems.

4.3 Reboot

systemctl reboot

4.4 Post-reboot verification on 8.x

# === Pre-flight: check for missing ISO references ===
# PVE 8+ refuses to start VMs with missing media β€” fix BEFORE bulk start

echo "=== Checking for VMs with missing ISO references ==="

MISSING_FOUND=0
for conf in /etc/pve/qemu-server/*.conf; do
    vmid=$(basename $conf .conf)
    
    # Check each ISO reference in this config
    grep -E "iso/" $conf | while read line; do
        ref=$(echo "$line" | grep -oE "[A-Za-z0-9_-]+:iso/[^,]+")
        [ -z "$ref" ] && continue
        
        path=$(pvesm path "$ref" 2>/dev/null)
        if [ -z "$path" ] || [ ! -f "$path" ]; then
            slot=$(echo "$line" | cut -d: -f1)
            echo "❌ VM $vmid slot $slot references missing: $ref"
            echo "   Fix: qm set $vmid --$slot none,media=cdrom"
        fi
    done
done

echo ""
echo "Review and run the suggested 'qm set' commands before starting VMs"

pveversion -v                  # confirm pve-manager 8.x, kernel 6.2 or 6.5
uname -r                       # confirm new kernel loaded
systemctl --failed
journalctl -p err -b --no-pager | tail -50

# === Start all VMs that were running pre-reboot if they weren't set to Auto Boot ===
for vmid in $(awk 'NR>1 {print $1}' /backup/host-config/vms-pre-reboot-8x.txt); do
    echo "Starting $vmid..."
    qm start $vmid
    sleep 5   # stagger to avoid I/O storm on shared storage
done

qm list

4.5 Pre-start audit: empty CD-ROMs

⚠️ PVE 9 rejects none,media=cdrom references with host_cdrom requires a file name error.
Find affected VMs and remove the CD-ROM device:

for conf in /etc/pve/qemu-server/*.conf; do
    vmid=$(basename $conf .conf)
    grep -E "^(ide|sata|scsi)[0-9]+: none,media=cdrom" $conf | while read line; do
        slot=$(echo "$line" | cut -d: -f1)
        echo "VM $vmid: removing empty CD-ROM at $slot"
        qm set $vmid --delete $slot
    done
done

Or fix individually via GUI: VM β†’ Hardware β†’ CD/DVD Drive β†’ Remove.


Phase 5 - Upgrade 8.x to 9.x

5.1 Patch 8.x to absolute latest first if still needs an update0

rm -r /etc/apt/sources.list.d/pve-enterprise.list
apt update
apt dist-upgrade
pveversion
# Expected: pve-manager/8.4-x or higher

5.2 Run the 8-to-9 checker

pve8to9 --full
⚠️ Same rule: zero FAIL items before continuing.

5.3 If you get a mixed repo failure and sysctl deprecated config, do the following

sed -i 's/bullseye/bookworm/g' /etc/apt/sources.list.d/pve-no-subscription.list
grep -vE '^\s*#|^\s*$' /etc/sysctl.conf > /etc/sysctl.d/99-local.conf
cat > /etc/sysctl.conf <<'EOF'
# Settings moved to /etc/sysctl.d/99-local.conf
# This file is kept empty per Debian convention.
EOF
sysctl --system
⚠️ Same rule: zero FAIL items before continuing.

5.4 Take a fresh vzdump before the second jump

VMID=$(qm list | awk 'NR>1 {print $1}'); vzdump $VMID --mode snapshot --compress zstd --storage local-backup --notes-template "Pre PVE 8.x→9.x upgrade {{node}} {{vmid}}"

5.5 Swap repositories: Bookworm β†’ Trixie

# Debian base
sed -i 's/bookworm/trixie/g' /etc/apt/sources.list

# Debian security suite (trixie format)
sed -i 's|bookworm-security|trixie-security|g' /etc/apt/sources.list

# Proxmox repo for PVE 9
cat > /etc/apt/sources.list.d/pve-install-repo.list <<'EOF'
deb http://download.proxmox.com/debian/pve trixie pve-no-subscription
EOF

# If using Ceph, switch its repo too β€” check current Ceph version first
# ceph -v
# Then update /etc/apt/sources.list.d/ceph.list to the matching trixie line

# Add trixie GPG key
wget -qO /etc/apt/trusted.gpg.d/proxmox-release-trixie.gpg \
    https://enterprise.proxmox.com/debian/proxmox-release-trixie.gpg
    
rm -f /etc/apt/sources.list.d/pve-enterprise.list
rm -f /etc/apt/sources.list.d/pve-enterprise.sources
rm -rf /etc/apt/sources.list.d/pve-no-subscription.list
rm -f /etc/apt/trusted.gpg.d/proxmox-release-bookworm.gpg
⚠️ Verify the trixie key fingerprint against the official Proxmox docs at the time of upgrade β€” keys rotate.

5.6 Update package index

apt update

5.7 Run the 8β†’9 dist-upgrade

apt dist-upgrade

Same conf files rules as Phase 3.5. Keep local for anything you've customized.

5.8 Verify userspace upgrade

pveversion
# Expected: pve-manager/9.x

systemctl --failed

Phase 6 - Reboot and Verify on 9.x

6.0 Capture full /etc/pve state

BACKUP_DIR=/backup/host-config/$(hostname)-$(date +%F)
tar czf $BACKUP_DIR/etc-pve.tgz /etc/pve 2>/dev/null

6.1 Shut down VM cleanly

VMID=$(qm list | awk 'NR>1 {print $1}')
for vmid in $(qm list | awk '$3=="running" {print $1}'); do
    echo "Shutting down $vmid..."
    qm shutdown $vmid --timeout 180 &
done
# Wait and verify
qm list

# Force-stop any stragglers (rare, but covers stuck guests)
for vmid in $(qm list | awk '$3=="running" {print $1}'); do
    echo "Force-stopping stuck VM $vmid..."
    qm stop $vmid
done

6.2 Reboot

systemctl reboot

# Then take fresh inventory backup
qm list > /root/vms-to-restart.txt

6.3 Join the cluster

πŸ’‘
NOT TESTED, proceed with caution.
# === Step 1: Confirm VMs are stopped (Phase 6.1 + reboot did this) ===
qm list
# All should be "stopped" or "paused" β€” not "running"

# === Step 2: Backup EVERYTHING that pvecm add will overwrite ===
mkdir -p /root/pve-pre-join-backup
BACKUP=/root/pve-pre-join-backup

# VM and CT configs (the "host already contains virtual guests" blocker)
mkdir -p $BACKUP/qemu-server $BACKUP/lxc
mv /etc/pve/qemu-server/*.conf $BACKUP/qemu-server/ 2>/dev/null
mv /etc/pve/lxc/*.conf $BACKUP/lxc/ 2>/dev/null

# ⚠️ storage.cfg WILL be overwritten β€” back it up
cp /etc/pve/storage.cfg $BACKUP/storage.cfg

# Other things pvecm add may replace (per the dev's warning)
cp -r /etc/pve/firewall $BACKUP/firewall 2>/dev/null
cp /etc/pve/jobs.cfg $BACKUP/jobs.cfg 2>/dev/null
cp /etc/pve/user.cfg $BACKUP/user.cfg 2>/dev/null
cp /etc/pve/datacenter.cfg $BACKUP/datacenter.cfg 2>/dev/null
cp -r /etc/pve/sdn $BACKUP/sdn 2>/dev/null

# Verify pvecm add precondition met
ls /etc/pve/qemu-server/    # MUST be empty
ls /etc/pve/lxc/            # MUST be empty

# === Step 3: Unmask corosync (was masked in Phase 1.1) ===
systemctl unmask corosync
systemctl status corosync   # inactive, condition unmet β€” correct

# === Step 4: Join Cluster ===
THIS_IP=$(ip -4 -br addr show vmbr0 | awk '{print $3}' | cut -d/ -f1)
echo "Joining as IP: $THIS_IP"

pvecm add $CLUSTERIP \
    --link0 $THIS_IP \
    --use_ssh \

# Wait for cluster sync
sleep 10
pvecm status
pvecm nodes

# === Step 5: Restore VM configs (now into cluster-replicated pmxcfs) ===
mv $BACKUP/qemu-server/*.conf /etc/pve/qemu-server/ 2>/dev/null
mv $BACKUP/lxc/*.conf /etc/pve/lxc/ 2>/dev/null

# Verify replication to CLUSTERIP
sleep 5
ssh root@CLUSTERIP "ls /etc/pve/qemu-server/"
# Should show all your restored VMIDs + the local ones

# === Step 6: Restore storage.cfg entries (MERGE, do not replace) ===
# You need to ADD: this node's local-* and storage entries

# View what Cluster has after join
cat /etc/pve/storage.cfg

# View what you backed up
cat $BACKUP/storage.cfg

# Manually merge β€” append your node-specific storage entries
nano /etc/pve/storage.cfg
# Add stanzas like:
#   dir: STORAGENAME
#       path /backup
#       content images,rootdir,vztmpl,iso,backup,snippets
#       nodes NODENAME
#       prune-backups keep-all=1
#       shared 0
#
#   dir: local-backup
#       path /backup
#       content backup,iso,snippets
#       nodes NODENAME
#       shared 0

pvesm status   # verify all storages active

# === Step 7: Restore firewall/jobs/etc if you had them ===
# Only restore what's NOT cluster-conflicting. For example:
# - Firewall IPSets/rules: usually merge cleanly
# - jobs.cfg (vzdump schedules): merge node-specific entries only
# - user.cfg: ⚠️ DO NOT replace cluster's β€” it has cluster's users now

diff $BACKUP/jobs.cfg /etc/pve/jobs.cfg 2>/dev/null
# If you had vzdump schedules, manually re-add them to /etc/pve/jobs.cfg

# === Done β€” proceed to Phase 6.4 to start VMs ===

6.4 Post-reboot verification on 9.x

pveversion -v
uname -r                       # kernel 7.0+ or whatever 9.x ships
systemctl --failed
journalctl -p err -b --no-pager | tail -100

# === Start all VMs that were running pre-reboot ===
for vmid in $(awk 'NR>1 {print $1}' /root/vms-to-restart.txt); do
    qm start $vmid
    sleep 5
done

qm list

6.5 Re-enable scheduled jobs

systemctl start pvescheduler 2>/dev/null || systemctl start pve-daily-update.timer
systemctl start cron

Phase 7 - Post-Upgrade Cleanup

7.1 Remove old kernels

# List installed kernels
dpkg -l 'pve-kernel-*' 'proxmox-kernel-*' | awk '/^ii/ {print $2}'

# Use Proxmox's kernel tooling
proxmox-boot-tool kernel list
apt autoremove --purge

7.2 Drop the pre-upgrade snapshot (only after VM is confirmed healthy for several days)

VMID=$(qm list | awk 'NR>1 {print $1}')
qm listsnapshot $VMID
qm delsnapshot $VMID pre_upgrade_YYYYMMDD

7.3 Final state verification

pveversion -v
pve8to9 --full   # should report nothing actionable
qm list
df -h /backup

Rollback Plan

Scenario A: Upgrade fails mid-dist-upgrade, VM still running

# Try to recover dpkg first
dpkg --configure -a
apt -f install
apt dist-upgrade

If unrecoverable, shut down the VM, restore from vzdump on a working PVE node.

Scenario B: Reboot fails, node won't come up

# From Proxmox boot menu, select previous kernel
# Or boot from Debian rescue ISO and:
mount /dev/<root> /mnt
chroot /mnt
# Re-run: dpkg --configure -a, apt -f install

Scenario C: Node boots but cluster join is broken

# Restore cluster config from /backup
tar xzf /backup/host-config/<dated>/etc-node.tgz -C /
systemctl restart corosync pve-cluster
pvecm status

Scenario D: VM disk corrupted

# Restore from vzdump
VMID=<id>
qmrestore /backup/dump/vzdump-qemu-${VMID}-*.vma.zst $VMID --force --storage <target>

Scenario E: pvecm add fails or leaves cluster in inconsistent state

# === On the cluster ===
# Forget the failed-to-join node
pvecm delnode <this-node>

# === On the joining node ===
# Restore pre-join state
systemctl stop pve-cluster corosync
pmxcfs -l
rm -f /etc/corosync/* /etc/pve/corosync.conf
killall pmxcfs
sleep 2
systemctl start pve-cluster

# Restore configs from your backup
mv /root/pve-pre-join-backup/qemu-server/*.conf /etc/pve/qemu-server/
mv /root/pve-pre-join-backup/lxc/*.conf /etc/pve/lxc/
cp /root/pve-pre-join-backup/storage.cfg /etc/pve/storage.cfg
# ... etc

# Re-mask corosync
systemctl mask corosync

# Verify VMs visible again
qm list

# Diagnose what went wrong, then retry Phase 6.3

Gotchas

  • ⚠️ Never skip pve7to8 or pve8to9. They catch incompatible storage configs, stale repos, dead packages, and corosync version mismatches before they bite during dist-upgrade.
  • ⚠️ Do not run apt upgrade β€” only apt dist-upgrade. Plain upgrade refuses to install new dependencies and leaves the node half-upgraded.
  • ⚠️ apt-key is removed in Bookworm. Any third-party repo using legacy /etc/apt/trusted.gpg will warn β€” migrate to /etc/apt/keyrings/ with signed-by= syntax.
  • ⚠️ Conf file prompts are not auto-answered by -y. Watch them.
  • ⚠️ Mixed-version cluster is transient. Don't run a cluster on PVE 7 + PVE 9 nodes for weeks β€” pveproxy GUI will show stale data and live migration between major versions is not supported.
  • ⚠️ Old LXC templates (CentOS 7, Debian 9, Ubuntu 18.04) often fail to start under PVE 9's kernel 6.8+ and cgroup v2. You said only one VM, no LXC β€” non-issue here, but worth noting.
  • ⚠️ Corosync version skew between cluster nodes during a rolling upgrade is tolerated only short-term. Don't leave this node on a different corosync major than the rest of the cluster for more than a maintenance window.
  • ⚠️ ZFS root + secure boot: if proxmox-boot-tool refresh complains about stale ESPs, fix with proxmox-boot-tool init /dev/<esp-partition> before rebooting.