DSM got broken after Proxmox 6.x to 7.x migration. Then I fixed it but it gets broken again and again


Recommended Posts

Hi all,

 

DSM gets "continually broken" (I mean, I fix it but then it gets broken again) after a Proxmox 6 to 7 upgrade. Jun's loader shows (serial terminal):

"lz failed 9
alloc failed
Cannot load /boot/zImage"

 

But please, let me first explain all details and read until the end because the situation is very strange...

 

I'm using JUN'S LOADER v1.03b - DS3617xs in a Proxmox VM (on HP Gen8), with DSM 6.2.3-25426. I've been using this config for months without any problem on Proxmox **6.x** (this seems relevant).

 

The other day I decided to upgrade Proxmox to **7.x**. I upgraded Proxmox and rebooted. All my VMs booted up perfectly... except DSM one :(

 

I observed that I couldn't reach my network shares at DSM, and after a quick investigation, I discovered that DSM booted up in "installation mode", showing this message in web interface:

"We've detected that the hard drives of your current DS3617xs has been moved from a previous DS3617xs, and installing a newer DSM is required before continuing.".

 

I thought DSM partition may have been corrupted in some way or (most likely) Proxmox 7 introduced kind of a "virtual hw change" so now DSM thinks it's been booted in another hw. This last option is very plausible because Proxmox 7 uses qemu 6.0, while Proxmox 6 (latest) uses qemu 5.2. Maybe other changes in new version of Proxmox could have been introduced (for instance, I've read something regarding assigned MAC of a bridge interface being different).

 

What I did was:

1/ Power off DSM VM.

2/ Back up partition 1 for all my 4 disks (i.e the md0 array which contains DSM OS).

3/ Power on DSM VM.

4/ I followed instructions in the web interface, chose "Migrate" (which basically keeps my data and config untouched), selected a manual installation of DSM and uploaded the .pat corresponding to the very same version I was already running before the problem, i.e. DSM_DS3617xs_25426.pat (DSM 6.2.3-25426). I didn't want to downgrade, and of course, I shouldn't upgrade because next version is 6.2.4, which is incompatible with Jun's loader.

5/ Migration got finished, DSM rebooted and... FIXED!!! :-) DSM was working again with no loss of data nor config.

 

*But* another problem arised later... When my server (Proxmox) got rebooted again, DSM VM resulted broken again but this time in a very different way: I couldn't ping my DSM VM, and after investigation, I concluded DSM kernel was not being loaded at all. Indeed, I attached a serial terminal to DSM VM and I could see Jun's loader being stuck at the very beginning with these messages:

"lz failed 9
alloc failed
Cannot load /boot/zImage"

 

No idea why this is happening nor what these messages really mean (well, it seems obvius kernel is not being loaded but I don't know why) !!

 

I managed to fix it again (yeah xD) by:

1/ Power off DSM VM.

2/ Restore partition 1 for all my disks from just the backup I took when solving former problem :)

3/ Power on DSM VM

4/ I confirmed loader worked again and that I got to the same point where DSM needed a migration

5/ I "migrated" exactly in the same way I had done minutes before :). FIXED!!

 

What's the problem then? Easy... every time I reboot my server (so Proxmox reboots), my DSM VM got broken again with the second error ("lz failed... etc), i.e, loader's kernel not being loaded. I could temporarily fix it but sooner or later I'll need to reboot Proxmox again and... boom again :-(

 

Any of these problems are familiar to you? Any clue about how to solve this or a least, some ideas I should focus my investigation on?

 

PLEASE, help!! :_(

 

 

PS: My Proxmox VM config (a.k.a. qemu config) (with some info redacted):

 

 

args: -device 'nec-usb-xhci,id=usb-ctl-synoboot,addr=0x18' -drive 'id=usb-drv-synoboot,file=/var/lib/vz/images/100/synoboot_103b_ds3617_roman.img,if=none,format=raw' -device 'usb-storage,id=usb-stor-synoboot,bootindex=1,removable=off,drive=usb-drv-synoboot' -netdev type=tap,id=net0,ifname=tap100i0 -device e1000e,mac=00:XX:XX:XX:XX:XX,netdev=net0,bus=pci.0,addr=0x12,id=net0

bios: seabios

boot: d

cores: 4

cpu: IvyBridge

hotplug: disk,network,usb

memory: 2048

name: NAS-Synology

numa: 0

onboot: 1

ostype: l26

sata0: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4NXXXXXXX,size=2930266584K

sata1: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4NYYYYYYY,size=2930266584K

sata2: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4NZZZZZZZ,size=2930266584K

sata3: /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7KAAAAAAA,size=3907018584K

scsihw: virtio-scsi-pci

serial0: socket

smbios1: uuid=9ba1da8f-1321-4a5e-8b00-c7020e51f8ee

sockets: 1

startup: order=1

usb0: host=5-2,usb3=1

 

Link to post
Share on other sites
  • 1 month later...

I am having a similar issue with my DSM VM after the upgrade. 

 

Mine will boot but there are no NICs with IP addresses to access the VM and the console just points me to the find.synology.com process. 

 

This VM is my active backup/clone so I can just nuke it and re-seed. I would prefer not to do that but this may be the faster method to getting things back up to the previous state. 

Link to post
Share on other sites

This is how I run DS3615 DSM6.2.3u3 with Jun's 1.03b and PVE 7.0.1 with latest updates:

args: -drive 'if=none,id=synoboot,format=raw,file=/var/lib/vz/images/xxx/synoboot.img' -device 'usb-storage,bus=ehci.0,drive=synoboot,bootindex=5'
boot: order=sata0
cores: 8
cpu: host,flags=+aes
hostpci0: 0000:04:00,pcie=1
machine: q35
memory: 8192
name: DSM
net0: vmxnet3=xx:xx:xx:xx:xx:xx,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
sata0: local-lvm:vm-xxx-disk-0,discard=on,size=100G,ssd=1
scsihw: megasas
serial0: socket
serial1: socket
serial2: socket
smbios1: uuid=${random uuid}
sockets: 1
tablet: 0
vmgenid: ${random uuid}

 

Notes:

- I specifly didn't add a usb device and re-use the pre-existing ehci.0 bus (=usb2.0).

- hostpci0 is a pci-passthrough of a lsi9211 controller with 8 drives - without passthrough, the line is irrelevant to you

- I do use vmxnet3 and can archive close to full 10gbps on the nic.

- serial ports are to access log output

 

There is an extra.lzma on the 2nd partition of the bootloader image, but I don't recall If I added it myself or it pre-existed. 

 

I modifed of grub\grub.cfg in the 1st partition of the bootloader image:

set vid=0x46f4
set pid=0x0001
set extra_args_3615='DiskIdxMap=00080D SataPortMap=866 SasIdxMap=0'

The first two are required to hide the usb boot device from DSM, the third to correct the drive order in DSM - though the last line will be indivual for each setup. 

 

SataPortMap settings explained:

8 = 1st controller has 8 ports ->  my pci passthrough lsi controller with 8 ports

6 = 2nd controller has 6 ports -> the sata controller with a 100gb additional disk (6=max drive count on PVE sata)

6 = 3rd controller has 6 ports -> for the additional sata controller I get listed with lspci -k | grep -A2 -i -E '(SCSI|SATA) Controller'`

 

DiskIdxMap settings explained:

00 = first drive of the 1st controller starts at drive 1 ( up to drive 8 )

08 = first drive of the 2nd controller starts at drive 9 ( up to drive 14 )

0D = frist drive of the 3rd controller starts at drive 15 ( up to drive 20 )

 

Like I wrote the DiskIdxMap and SataPortMap are indidual to each setup.

 

 

 

 

Edited by haydibe
Link to post
Share on other sites

I suspect that, in my case, the problem's root cause is some dynamic parameter which may be changing between boots (Xpenology and/or Proxmox reboots), probably due to some recent change in Proxmox, and which finally lead to DSM "detecting" that the hw has changed (but then I cannot explain why a recovered backup does work; it shouldn't either).

 

In your experience, which kind of hw changes do you think DSM could be detecting? See the (Proxmox) config I previosly posted (1st post in this thread).

 

Re @haydibe's config, it seems simplest than mine but he took a different approach than me: I bet he included updated drivers in extra.lzma (so for instance, vmxnet3 works). My approach was to have always the same synoboot img (no modifications, apart from the initial config), valid for different DSM versions.

 

Maybe my approach is worst since I have to maintain some hacks like the use of e1000e driver, which is not supported by Proxmox.

 

I'll think about it and maybe I'll go for @haydibe's config.

 

Thx!

Link to post
Share on other sites

You are overriding more settings manualy in the "-args" settings then I do.

I prefer to keep my configuration as clean and simple as possible, and as complicated as necessary :)

 

Observations:

I am not limiting the cpu to a specific architecture, I just use the host cpu as is, to not artificialy deactivate features of the cpu. Even though my config lacks the seabios setting for the bios, it applies to it, as it is the default setting.

 

The next thing that hit's the eye is that I am use the q35 chipset, and you probabaly use the i440fx?

 

I  for instance do not add the `nec-usb-xhci` usb 3.1 controller, I simply use the already existing "ehci.0" usb 2.0 controller. Jun injects the nec usb3.1 drivers, so this shouldn't be the issue. Your usb drive has the addition attribute "removable=off" and you do set an id for that drive (which i don't at the handle is not required for anything later on).

 

I have never added a network driver using args and as such lack the experiece to judge the validity of your configuration for the network part.

Since you define a MAC, bus and addr, I would expect that it "nails down" possible dynamic aspects, but again: I have no idea if it does.

 

Also I don't have hotplug configured at all and I did not specificly set any usb related stuff in the UI. I realy just hook into the default ehci device and just declare the bootdrive and attach it to the ehci bus. 

 

Probably you are right and I added the extra.lzma of ig88, I am not entirely sure, though it would make sense, as I am sure I at least tried to use virtio-nic instead. But for whatever reason ended up using the vmxnet3 driver. It must have had problems with the virtio driver. I don't remember.  All my three nodes have a 10gbps nic. Of course I was going to aim for a 10gbps capable interface in XPE as well.

 

 

 

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.