docker causing kernel panics after move to 6.2.4 on redpill


Recommended Posts

Morning all, I'm posting this as a new topic as, although it's relevant to and following an installation of a redpill bootloader, I don't think it's entirely relevant to the main thread. I hope that was the right decision.

 

I recently migrated my baremetal HP Gen8 Microserver (more details in my sig) from Jun's bootloader and 6.2.3 to redpill and 6.2.4. While the system is stable and fully functional once booted, I observed that it can sometimes be unstable immediately after booting into DSM and have seen it spontaneously reboot on a few occasions. I figured out how to get at the serial console through iLO yesterday and have tracked the problem down to docker causing kernel panics.

 

Below are three captures from the console:
 

[  103.818908] IPv6: ADDRCONF(NETDEV_CHANGE): dockerc0d00b9: link becomes ready
[  103.854064] docker0: port 1(dockerc0d00b9) entered forwarding state
[  103.885457] docker0: port 1(dockerc0d00b9) entered forwarding state
[  103.923967] IPv6: ADDRCONF(NETDEV_CHANGE): docker3a8cc71: link becomes ready
[  103.958558] docker0: port 2(docker3a8cc71) entered forwarding state
[  103.990058] docker0: port 2(docker3a8cc71) entered forwarding state
[  105.796257] aufs au_opts_verify:1571:dockerd[17060]: dirperm1 breaks the protection by the permission bits on the lower branch
[  118.904833] docker0: port 1(dockerc0d00b9) entered forwarding state
[  119.032982] docker0: port 2(docker3a8cc71) entered forwarding state
[  138.152539] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 7
[  138.188407] CPU: 7 PID: 19491 Comm: influxd Tainted: PF          O 3.10.105 #25556
[  138.225421] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  138.259973]  ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109bc8d58
[  138.296392]  ffff880109bc8cf0 0000000000000000 0000000000000007 000000000000001b
[  138.333751]  0000000000000007 ffffffff80000001 0000000000000010 ffff8801038d8c00
[  138.371395] Call Trace:
[  138.383636]  <NMI>  [<ffffffff814c904d>] ? dump_stack+0xc/0x15
[  138.412498]  [<ffffffff814c8121>] ? panic+0xbb/0x1ce
[  138.436526]  [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0
[  138.469406]  [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240
[  138.499832]  [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0
[  138.532041]  [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360
[  138.563004]  [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40
[  138.593865]  [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0
[  138.618725]  [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e
[  138.647390]  <<EOE>> 
[  138.658238] Rebooting in 3 seconds..

..and again:

[  120.861792] IPv6: ADDRCONF(NETDEV_CHANGE): docker81429d6: link becomes ready
[  120.895824] docker0: port 2(docker81429d6) entered forwarding state
[  120.926580] docker0: port 2(docker81429d6) entered forwarding state
[  120.997947] IPv6: ADDRCONF(NETDEV_CHANGE): docker28a2e17: link becomes ready
[  121.032061] docker0: port 1(docker28a2e17) entered forwarding state
[  121.063120] docker0: port 1(docker28a2e17) entered forwarding state
[  136.106729] docker0: port 1(docker28a2e17) entered forwarding state
[  136.518407] docker0: port 2(docker81429d6) entered forwarding state
[  191.452302] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
[  191.487637] CPU: 3 PID: 19775 Comm: containerd-shim Tainted: PF          O 3.10.105 #25556
[  191.528112] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  191.562597]  ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109ac8d58
[  191.599118]  ffff880109ac8cf0 0000000000000000 0000000000000003 000000000000002c
[  191.634943]  0000000000000003 ffffffff80000001 0000000000000010 ffff880103817c00
[  191.670604] Call Trace:
[  191.682506]  <NMI>  [<ffffffff814c904d>] ? dump_stack+0xc/0x15
[  191.710494]  [<ffffffff814c8121>] ? panic+0xbb/0x1ce
[  191.735108]  [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0
[  191.768203]  [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240
[  191.799789]  [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0
[  191.834349]  [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360
[  191.865505]  [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40
[  191.897683]  [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0
[  191.922372]  [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e
[  191.950899]  <<EOE>> 
[  191.961095] Rebooting in 3 seconds..

After completing the reboot it did it again:

[  140.355745] aufs au_opts_verify:1571:dockerd[18688]: dirperm1 breaks the protection by the permission bits on the lower branch
[  145.666134] docker0: port 3(docker9ee2ff2) entered forwarding state
[  150.217495] aufs au_opts_verify:1571:dockerd[18688]: dirperm1 breaks the protection by the permission bits on the lower branch
[  150.278812] device dockerbc899ad entered promiscuous mode
[  150.305436] IPv6: ADDRCONF(NETDEV_UP): dockerbc899ad: link is not ready
[  152.805689] IPv6: ADDRCONF(NETDEV_CHANGE): dockerbc899ad: link becomes ready
[  152.840264] docker0: port 5(dockerbc899ad) entered forwarding state
[  152.870670] docker0: port 5(dockerbc899ad) entered forwarding state
[  154.476203] docker0: port 4(docker07f3e3e) entered forwarding state
[  167.931582] docker0: port 5(dockerbc899ad) entered forwarding state
[  194.017549] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
[  194.052575] CPU: 2 PID: 19580 Comm: containerd-shim Tainted: PF          O 3.10.105 #25556
[  194.094270] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  194.128400]  ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109a88d58
[  194.164811]  ffff880109a88cf0 0000000000000000 0000000000000002 000000000000002b
[  194.201332]  0000000000000002 ffffffff80000001 0000000000000010 ffff880103ee5c00
[  194.238138] Call Trace:
[  194.250471]  <NMI>  [<ffffffff814c904d>] ? dump_stack+0xc/0x15
[  194.279225]  [<ffffffff814c8121>] ? panic+0xbb/0x1ce
[  194.304100]  [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0
[  194.337400]  [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240
[  194.368795]  [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0
[  194.401338]  [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360
[  194.432957]  [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40
[  194.464708]  [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0
[  194.488902]  [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e
[  194.517219]  <<EOE>> 
[  195.556746] Shutting down cpus with NMI
[  195.576047] Rebooting in 3 seconds..

 

Following the third reboot I logged into the DSM web UI as quickly as I could after booting and stopped all the running containers. It's been stable since then.

 

Each time it kernel panic'd in these three examples it was something to do with docker: either containers-shim or influxd (which is a process in an InfluxDB container).

 

The influxdb container is running off a pretty ancient image version (1.77), so I'm going to try updating the image and see if it's any more stable. I'll update with results.

Link to post
Share on other sites

I updated Grafana (which is accessing the influxdb) to 6.7.6 and influxdb to 1.9.3 and that seemed to go well without crashing, but then when I stopped influxdb it kernel panic'd again.

 

1.9.3 is the latest version of the 1.x train, so I'm going to try upgrading to later major version release and see if that helps matters...

 

...

 

Further experimentation with updating the version of influxdb suggests that the kernel panics are always either related to the influxdb process itself or containers-shim, and seem to occur when either starting or stopping the influxdb container. If the container starts up without precipitating a kernel panic it appears to continue to run without error indefinitely.

Edited by WiteWulf
Link to post
Share on other sites
1 hour ago, WiteWulf said:

1.9.3 is the latest version of the 1.x train, so I'm going to try upgrading to later major version release and see if that helps matters...

 

Last time I checked Influxdb 2.x used a different query language, which will require you to update all queries in your dashboards... Though, this was two or three months ago, maybee they final ported the old query language to Influxdb 2.x as well.

  • Like 1
Link to post
Share on other sites

That explains why my dashboards were all empty when I tried it with 2.x, then :)

 

Ah, the joys of upgrading database containers for docker apps. This is why I typically take an "if it ain't broke, don't fix it" approach to stuff at home...

Edited by WiteWulf
Link to post
Share on other sites

I've definitely tied this down to the influxdb container. Everything else runs quite happily now when set to auto-start on boot. Guess I need to do the database migration properly and see how that turns out.

Link to post
Share on other sites

Well, I've set up a new influxdb container (2.0.8), migrated the 1.7.7 data to it and it's not crashed so far! Just got to get Grafana talking to it again now. I'm not going to update this thread any more as I don't think it's relevant to xpenology from this point.

 

*edit*

 

Spoiler alert: it kernel panic'd again 🙄

 

I give up...

 

*edit2*

 

I forgot to mention that I also upgraded the DSM to 7.0.1-RC1 redpill while I was working on this, and it obviously didn't fix the problem.

Edited by WiteWulf
Link to post
Share on other sites
  • 2 weeks later...

Quick update as no one in the main redpill thread seemed to want to move the discussion here :)

 

As more people move to running redpill on their on their systems there have been additional reports of docker either kernel panic'ing on baremetal installs, or locking up the VM on proxmox and ESXi installs.

 

A commonality seems to be that those running ds3615xs images are seeing crashes, while those running ds918+ images are stable. One user has run both 918+ and 3615xs images in the same hardware and observed crashes with the 3615xs image but not the 918+.

 

Reports from owners of genuine Synology hardware suggest that docker is very stable under DSM7.x

 

The captured output from a docker-related kernel panic was observed to be very similar to that in this Redhat open issue:

https://access.redhat.com/solutions/1354963

 

Redhat indicates the problem was fixed in later kernel versions, although the kernel on DSM 7.0.1-RC1 appears to be later than the one referenced by RH.

 

I think I've got enough to log an issue against redpill-lkm on GitHub now, so I'll do that and report back with any updates.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.