Jump to content
XPEnology Community

docker causing kernel panics after move to 6.2.4 on redpill


WiteWulf

Recommended Posts

Morning all, I'm posting this as a new topic as, although it's relevant to and following an installation of a redpill bootloader, I don't think it's entirely relevant to the main thread. I hope that was the right decision.

 

I recently migrated my baremetal HP Gen8 Microserver (more details in my sig) from Jun's bootloader and 6.2.3 to redpill and 6.2.4. While the system is stable and fully functional once booted, I observed that it can sometimes be unstable immediately after booting into DSM and have seen it spontaneously reboot on a few occasions. I figured out how to get at the serial console through iLO yesterday and have tracked the problem down to docker causing kernel panics.

 

Below are three captures from the console:
 

[  103.818908] IPv6: ADDRCONF(NETDEV_CHANGE): dockerc0d00b9: link becomes ready
[  103.854064] docker0: port 1(dockerc0d00b9) entered forwarding state
[  103.885457] docker0: port 1(dockerc0d00b9) entered forwarding state
[  103.923967] IPv6: ADDRCONF(NETDEV_CHANGE): docker3a8cc71: link becomes ready
[  103.958558] docker0: port 2(docker3a8cc71) entered forwarding state
[  103.990058] docker0: port 2(docker3a8cc71) entered forwarding state
[  105.796257] aufs au_opts_verify:1571:dockerd[17060]: dirperm1 breaks the protection by the permission bits on the lower branch
[  118.904833] docker0: port 1(dockerc0d00b9) entered forwarding state
[  119.032982] docker0: port 2(docker3a8cc71) entered forwarding state
[  138.152539] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 7
[  138.188407] CPU: 7 PID: 19491 Comm: influxd Tainted: PF          O 3.10.105 #25556
[  138.225421] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  138.259973]  ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109bc8d58
[  138.296392]  ffff880109bc8cf0 0000000000000000 0000000000000007 000000000000001b
[  138.333751]  0000000000000007 ffffffff80000001 0000000000000010 ffff8801038d8c00
[  138.371395] Call Trace:
[  138.383636]  <NMI>  [<ffffffff814c904d>] ? dump_stack+0xc/0x15
[  138.412498]  [<ffffffff814c8121>] ? panic+0xbb/0x1ce
[  138.436526]  [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0
[  138.469406]  [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240
[  138.499832]  [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0
[  138.532041]  [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360
[  138.563004]  [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40
[  138.593865]  [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0
[  138.618725]  [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e
[  138.647390]  <<EOE>> 
[  138.658238] Rebooting in 3 seconds..

..and again:

[  120.861792] IPv6: ADDRCONF(NETDEV_CHANGE): docker81429d6: link becomes ready
[  120.895824] docker0: port 2(docker81429d6) entered forwarding state
[  120.926580] docker0: port 2(docker81429d6) entered forwarding state
[  120.997947] IPv6: ADDRCONF(NETDEV_CHANGE): docker28a2e17: link becomes ready
[  121.032061] docker0: port 1(docker28a2e17) entered forwarding state
[  121.063120] docker0: port 1(docker28a2e17) entered forwarding state
[  136.106729] docker0: port 1(docker28a2e17) entered forwarding state
[  136.518407] docker0: port 2(docker81429d6) entered forwarding state
[  191.452302] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
[  191.487637] CPU: 3 PID: 19775 Comm: containerd-shim Tainted: PF          O 3.10.105 #25556
[  191.528112] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  191.562597]  ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109ac8d58
[  191.599118]  ffff880109ac8cf0 0000000000000000 0000000000000003 000000000000002c
[  191.634943]  0000000000000003 ffffffff80000001 0000000000000010 ffff880103817c00
[  191.670604] Call Trace:
[  191.682506]  <NMI>  [<ffffffff814c904d>] ? dump_stack+0xc/0x15
[  191.710494]  [<ffffffff814c8121>] ? panic+0xbb/0x1ce
[  191.735108]  [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0
[  191.768203]  [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240
[  191.799789]  [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0
[  191.834349]  [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360
[  191.865505]  [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40
[  191.897683]  [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0
[  191.922372]  [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e
[  191.950899]  <<EOE>> 
[  191.961095] Rebooting in 3 seconds..

After completing the reboot it did it again:

[  140.355745] aufs au_opts_verify:1571:dockerd[18688]: dirperm1 breaks the protection by the permission bits on the lower branch
[  145.666134] docker0: port 3(docker9ee2ff2) entered forwarding state
[  150.217495] aufs au_opts_verify:1571:dockerd[18688]: dirperm1 breaks the protection by the permission bits on the lower branch
[  150.278812] device dockerbc899ad entered promiscuous mode
[  150.305436] IPv6: ADDRCONF(NETDEV_UP): dockerbc899ad: link is not ready
[  152.805689] IPv6: ADDRCONF(NETDEV_CHANGE): dockerbc899ad: link becomes ready
[  152.840264] docker0: port 5(dockerbc899ad) entered forwarding state
[  152.870670] docker0: port 5(dockerbc899ad) entered forwarding state
[  154.476203] docker0: port 4(docker07f3e3e) entered forwarding state
[  167.931582] docker0: port 5(dockerbc899ad) entered forwarding state
[  194.017549] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
[  194.052575] CPU: 2 PID: 19580 Comm: containerd-shim Tainted: PF          O 3.10.105 #25556
[  194.094270] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  194.128400]  ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109a88d58
[  194.164811]  ffff880109a88cf0 0000000000000000 0000000000000002 000000000000002b
[  194.201332]  0000000000000002 ffffffff80000001 0000000000000010 ffff880103ee5c00
[  194.238138] Call Trace:
[  194.250471]  <NMI>  [<ffffffff814c904d>] ? dump_stack+0xc/0x15
[  194.279225]  [<ffffffff814c8121>] ? panic+0xbb/0x1ce
[  194.304100]  [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0
[  194.337400]  [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240
[  194.368795]  [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0
[  194.401338]  [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360
[  194.432957]  [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40
[  194.464708]  [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0
[  194.488902]  [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e
[  194.517219]  <<EOE>> 
[  195.556746] Shutting down cpus with NMI
[  195.576047] Rebooting in 3 seconds..

 

Following the third reboot I logged into the DSM web UI as quickly as I could after booting and stopped all the running containers. It's been stable since then.

 

Each time it kernel panic'd in these three examples it was something to do with docker: either containers-shim or influxd (which is a process in an InfluxDB container).

 

The influxdb container is running off a pretty ancient image version (1.77), so I'm going to try updating the image and see if it's any more stable. I'll update with results.

Link to comment
Share on other sites

I updated Grafana (which is accessing the influxdb) to 6.7.6 and influxdb to 1.9.3 and that seemed to go well without crashing, but then when I stopped influxdb it kernel panic'd again.

 

1.9.3 is the latest version of the 1.x train, so I'm going to try upgrading to later major version release and see if that helps matters...

 

...

 

Further experimentation with updating the version of influxdb suggests that the kernel panics are always either related to the influxdb process itself or containers-shim, and seem to occur when either starting or stopping the influxdb container. If the container starts up without precipitating a kernel panic it appears to continue to run without error indefinitely.

Edited by WiteWulf
Link to comment
Share on other sites

1 hour ago, WiteWulf said:

1.9.3 is the latest version of the 1.x train, so I'm going to try upgrading to later major version release and see if that helps matters...

 

Last time I checked Influxdb 2.x used a different query language, which will require you to update all queries in your dashboards... Though, this was two or three months ago, maybee they final ported the old query language to Influxdb 2.x as well.

  • Like 1
Link to comment
Share on other sites

That explains why my dashboards were all empty when I tried it with 2.x, then :)

 

Ah, the joys of upgrading database containers for docker apps. This is why I typically take an "if it ain't broke, don't fix it" approach to stuff at home...

Edited by WiteWulf
Link to comment
Share on other sites

Well, I've set up a new influxdb container (2.0.8), migrated the 1.7.7 data to it and it's not crashed so far! Just got to get Grafana talking to it again now. I'm not going to update this thread any more as I don't think it's relevant to xpenology from this point.

 

*edit*

 

Spoiler alert: it kernel panic'd again 🙄

 

I give up...

 

*edit2*

 

I forgot to mention that I also upgraded the DSM to 7.0.1-RC1 redpill while I was working on this, and it obviously didn't fix the problem.

Edited by WiteWulf
Link to comment
Share on other sites

  • 2 weeks later...

Quick update as no one in the main redpill thread seemed to want to move the discussion here :)

 

As more people move to running redpill on their on their systems there have been additional reports of docker either kernel panic'ing on baremetal installs, or locking up the VM on proxmox and ESXi installs.

 

A commonality seems to be that those running ds3615xs images are seeing crashes, while those running ds918+ images are stable. One user has run both 918+ and 3615xs images in the same hardware and observed crashes with the 3615xs image but not the 918+.

 

Reports from owners of genuine Synology hardware suggest that docker is very stable under DSM7.x

 

The captured output from a docker-related kernel panic was observed to be very similar to that in this Redhat open issue:

https://access.redhat.com/solutions/1354963

 

Redhat indicates the problem was fixed in later kernel versions, although the kernel on DSM 7.0.1-RC1 appears to be later than the one referenced by RH.

 

I think I've got enough to log an issue against redpill-lkm on GitHub now, so I'll do that and report back with any updates.

Link to comment
Share on other sites

  • 2 weeks later...
3 hours ago, matyyy said:

Not only docker causing this.

 

image.thumb.png.b37c58ac7e73e6d54c6ca76e1b29a7e6.png

searching in control panel causes it too... :( (rc 7.0.1)

 

Yes synoelastid (universal search) is also one of the things that are causing soft lockups and system crash. Check the issues in Github for more details

Edited by pocopico
Link to comment
Share on other sites

With the recent commits to redpill-lkm it looks like the soft lockups on virtualised systems are now solved, but hard lockups on baremetal are still a problem.

 

TTG theorise that this could be down to "an over-reactive kernel protection mechanism", namely the NMI watchdog. The NMI watchdog exists (among other things) to automatically reboot a linux machine if it gets hung up for some reason. Unfortunately it appears to be falsely identifying kernel hangs on baremetal with DS3615xs image.

 

A workaround is to disable the NMI watchdog, but this is not recommended and you should understand the implications of taking this step before applying it to your system. Having said that, I've been running my system for a few days now with it disabled and it has been very stable. YMMV!!!

Link to comment
Share on other sites

I did try to a. Increase the watchdog threshold to 60 and b.  To disable the nmi watchdog in the past (before the fixes) and i was still getting lockups. Sometimes hard, sometimes soft. It’s not a viable and long term solution. I guess disabling the nmi watchdog was just to increase the teams confidence for a possible fix. Let’s wait and see. 

Link to comment
Share on other sites

Yes, I agree that it's not a viable long-term solution. I also tried increasing the watchdog threshold to a higher value and, while the kernel panics took longer to occur (matching the increased timer value), they still happened eventually. I hadn't tried disabling it altogether before, though.

 

As you say, let's wait and see. Hopefully these developments will lead to a more sustainable solution.

Link to comment
Share on other sites

  • 2 weeks later...

@Orphée @nemesis122 @pocopico @erkify @dodo-dk

 

Hi folks, I asked you previously what CPU you had in your Gen8 machines when experiencing the kernel panic and soft lock issues, and thank you for your responses. In the main thread @Kouill has demonstrated a Gen 8 running the influx-test docker stress test and it not crashing their machine since they moved to using the onboard tg3 NIC rather than PCIe NIC.

 

I've had less successful results on my system trying the same thing (it's more stable, but still kernel panics under load), but I'm curious to know what NIC you were using on your systems?

 

If there's a commonality of a particular PCIe NIC with the kernel panics that could be significant.

Link to comment
Share on other sites

@Kouill was kind enough to give me a copy of their boot img so I could attempt to replicate their influx-test without crashes.

 

Unfortunately the influx-test kernel panic'd my system within a few seconds.

 

NB. there were the following differences:

- kouill boots their system from the internal microSD slot, whereas mine was on a USB stick on the internal port

- in the grub.cfg I changed the serial number to match mine

- in the grub.cfg I changed mac1 to match mine

- in the grub.cfg I changed I added mac2 with my mac address

Link to comment
Share on other sites

Just now, Orphée said:

Well, you already know it, but I'm using ESXi, with a LSI SAS HBA card.

So I don't have a spare PCI slot. I'm using internal NIC, but with ESXi, the VM is set with E1000e or VMXNet3 with addional driver.

Yeah, you've got something in your PCIe slot, though. That may be significant.

Link to comment
Share on other sites

In virtual machine ?

Do you really think so ?

the SAS card was not passed through to this VM. Before the latest patch I was able to make VM crash with default virtual SATA disk (no real HDD with HBA passthrough).

 

I was running :

- main prod Xpenology Jun's loader with HBA card passthrouh

- new VM Redpill with virtual SATA drive only.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...