WiteWulf Posted September 8, 2021 Share #1 Posted September 8, 2021 Morning all, I'm posting this as a new topic as, although it's relevant to and following an installation of a redpill bootloader, I don't think it's entirely relevant to the main thread. I hope that was the right decision. I recently migrated my baremetal HP Gen8 Microserver (more details in my sig) from Jun's bootloader and 6.2.3 to redpill and 6.2.4. While the system is stable and fully functional once booted, I observed that it can sometimes be unstable immediately after booting into DSM and have seen it spontaneously reboot on a few occasions. I figured out how to get at the serial console through iLO yesterday and have tracked the problem down to docker causing kernel panics. Below are three captures from the console: [ 103.818908] IPv6: ADDRCONF(NETDEV_CHANGE): dockerc0d00b9: link becomes ready [ 103.854064] docker0: port 1(dockerc0d00b9) entered forwarding state [ 103.885457] docker0: port 1(dockerc0d00b9) entered forwarding state [ 103.923967] IPv6: ADDRCONF(NETDEV_CHANGE): docker3a8cc71: link becomes ready [ 103.958558] docker0: port 2(docker3a8cc71) entered forwarding state [ 103.990058] docker0: port 2(docker3a8cc71) entered forwarding state [ 105.796257] aufs au_opts_verify:1571:dockerd[17060]: dirperm1 breaks the protection by the permission bits on the lower branch [ 118.904833] docker0: port 1(dockerc0d00b9) entered forwarding state [ 119.032982] docker0: port 2(docker3a8cc71) entered forwarding state [ 138.152539] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 7 [ 138.188407] CPU: 7 PID: 19491 Comm: influxd Tainted: PF O 3.10.105 #25556 [ 138.225421] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019 [ 138.259973] ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109bc8d58 [ 138.296392] ffff880109bc8cf0 0000000000000000 0000000000000007 000000000000001b [ 138.333751] 0000000000000007 ffffffff80000001 0000000000000010 ffff8801038d8c00 [ 138.371395] Call Trace: [ 138.383636] <NMI> [<ffffffff814c904d>] ? dump_stack+0xc/0x15 [ 138.412498] [<ffffffff814c8121>] ? panic+0xbb/0x1ce [ 138.436526] [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0 [ 138.469406] [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240 [ 138.499832] [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0 [ 138.532041] [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360 [ 138.563004] [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40 [ 138.593865] [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0 [ 138.618725] [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e [ 138.647390] <<EOE>> [ 138.658238] Rebooting in 3 seconds.. ..and again: [ 120.861792] IPv6: ADDRCONF(NETDEV_CHANGE): docker81429d6: link becomes ready [ 120.895824] docker0: port 2(docker81429d6) entered forwarding state [ 120.926580] docker0: port 2(docker81429d6) entered forwarding state [ 120.997947] IPv6: ADDRCONF(NETDEV_CHANGE): docker28a2e17: link becomes ready [ 121.032061] docker0: port 1(docker28a2e17) entered forwarding state [ 121.063120] docker0: port 1(docker28a2e17) entered forwarding state [ 136.106729] docker0: port 1(docker28a2e17) entered forwarding state [ 136.518407] docker0: port 2(docker81429d6) entered forwarding state [ 191.452302] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3 [ 191.487637] CPU: 3 PID: 19775 Comm: containerd-shim Tainted: PF O 3.10.105 #25556 [ 191.528112] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019 [ 191.562597] ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109ac8d58 [ 191.599118] ffff880109ac8cf0 0000000000000000 0000000000000003 000000000000002c [ 191.634943] 0000000000000003 ffffffff80000001 0000000000000010 ffff880103817c00 [ 191.670604] Call Trace: [ 191.682506] <NMI> [<ffffffff814c904d>] ? dump_stack+0xc/0x15 [ 191.710494] [<ffffffff814c8121>] ? panic+0xbb/0x1ce [ 191.735108] [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0 [ 191.768203] [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240 [ 191.799789] [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0 [ 191.834349] [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360 [ 191.865505] [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40 [ 191.897683] [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0 [ 191.922372] [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e [ 191.950899] <<EOE>> [ 191.961095] Rebooting in 3 seconds.. After completing the reboot it did it again: [ 140.355745] aufs au_opts_verify:1571:dockerd[18688]: dirperm1 breaks the protection by the permission bits on the lower branch [ 145.666134] docker0: port 3(docker9ee2ff2) entered forwarding state [ 150.217495] aufs au_opts_verify:1571:dockerd[18688]: dirperm1 breaks the protection by the permission bits on the lower branch [ 150.278812] device dockerbc899ad entered promiscuous mode [ 150.305436] IPv6: ADDRCONF(NETDEV_UP): dockerbc899ad: link is not ready [ 152.805689] IPv6: ADDRCONF(NETDEV_CHANGE): dockerbc899ad: link becomes ready [ 152.840264] docker0: port 5(dockerbc899ad) entered forwarding state [ 152.870670] docker0: port 5(dockerbc899ad) entered forwarding state [ 154.476203] docker0: port 4(docker07f3e3e) entered forwarding state [ 167.931582] docker0: port 5(dockerbc899ad) entered forwarding state [ 194.017549] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2 [ 194.052575] CPU: 2 PID: 19580 Comm: containerd-shim Tainted: PF O 3.10.105 #25556 [ 194.094270] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019 [ 194.128400] ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109a88d58 [ 194.164811] ffff880109a88cf0 0000000000000000 0000000000000002 000000000000002b [ 194.201332] 0000000000000002 ffffffff80000001 0000000000000010 ffff880103ee5c00 [ 194.238138] Call Trace: [ 194.250471] <NMI> [<ffffffff814c904d>] ? dump_stack+0xc/0x15 [ 194.279225] [<ffffffff814c8121>] ? panic+0xbb/0x1ce [ 194.304100] [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0 [ 194.337400] [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240 [ 194.368795] [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0 [ 194.401338] [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360 [ 194.432957] [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40 [ 194.464708] [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0 [ 194.488902] [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e [ 194.517219] <<EOE>> [ 195.556746] Shutting down cpus with NMI [ 195.576047] Rebooting in 3 seconds.. Following the third reboot I logged into the DSM web UI as quickly as I could after booting and stopped all the running containers. It's been stable since then. Each time it kernel panic'd in these three examples it was something to do with docker: either containers-shim or influxd (which is a process in an InfluxDB container). The influxdb container is running off a pretty ancient image version (1.77), so I'm going to try updating the image and see if it's any more stable. I'll update with results. Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted September 8, 2021 Author Share #2 Posted September 8, 2021 (edited) I updated Grafana (which is accessing the influxdb) to 6.7.6 and influxdb to 1.9.3 and that seemed to go well without crashing, but then when I stopped influxdb it kernel panic'd again. 1.9.3 is the latest version of the 1.x train, so I'm going to try upgrading to later major version release and see if that helps matters... ... Further experimentation with updating the version of influxdb suggests that the kernel panics are always either related to the influxdb process itself or containers-shim, and seem to occur when either starting or stopping the influxdb container. If the container starts up without precipitating a kernel panic it appears to continue to run without error indefinitely. Edited September 8, 2021 by WiteWulf Quote Link to comment Share on other sites More sharing options...
haydibe Posted September 8, 2021 Share #3 Posted September 8, 2021 1 hour ago, WiteWulf said: 1.9.3 is the latest version of the 1.x train, so I'm going to try upgrading to later major version release and see if that helps matters... Last time I checked Influxdb 2.x used a different query language, which will require you to update all queries in your dashboards... Though, this was two or three months ago, maybee they final ported the old query language to Influxdb 2.x as well. 1 Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted September 8, 2021 Author Share #4 Posted September 8, 2021 (edited) That explains why my dashboards were all empty when I tried it with 2.x, then Ah, the joys of upgrading database containers for docker apps. This is why I typically take an "if it ain't broke, don't fix it" approach to stuff at home... Edited September 8, 2021 by WiteWulf Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted September 9, 2021 Author Share #5 Posted September 9, 2021 I've definitely tied this down to the influxdb container. Everything else runs quite happily now when set to auto-start on boot. Guess I need to do the database migration properly and see how that turns out. Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted September 9, 2021 Author Share #6 Posted September 9, 2021 (edited) Well, I've set up a new influxdb container (2.0.8), migrated the 1.7.7 data to it and it's not crashed so far! Just got to get Grafana talking to it again now. I'm not going to update this thread any more as I don't think it's relevant to xpenology from this point. *edit* Spoiler alert: it kernel panic'd again 🙄 I give up... *edit2* I forgot to mention that I also upgraded the DSM to 7.0.1-RC1 redpill while I was working on this, and it obviously didn't fix the problem. Edited September 9, 2021 by WiteWulf Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted September 17, 2021 Author Share #7 Posted September 17, 2021 Quick update as no one in the main redpill thread seemed to want to move the discussion here As more people move to running redpill on their on their systems there have been additional reports of docker either kernel panic'ing on baremetal installs, or locking up the VM on proxmox and ESXi installs. A commonality seems to be that those running ds3615xs images are seeing crashes, while those running ds918+ images are stable. One user has run both 918+ and 3615xs images in the same hardware and observed crashes with the 3615xs image but not the 918+. Reports from owners of genuine Synology hardware suggest that docker is very stable under DSM7.x The captured output from a docker-related kernel panic was observed to be very similar to that in this Redhat open issue: https://access.redhat.com/solutions/1354963 Redhat indicates the problem was fixed in later kernel versions, although the kernel on DSM 7.0.1-RC1 appears to be later than the one referenced by RH. I think I've got enough to log an issue against redpill-lkm on GitHub now, so I'll do that and report back with any updates. Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted September 17, 2021 Author Share #8 Posted September 17, 2021 https://github.com/RedPill-TTG/redpill-lkm/issues/21 1 Quote Link to comment Share on other sites More sharing options...
matyyy Posted October 1, 2021 Share #9 Posted October 1, 2021 Not only docker causing this. searching in control panel causes it too... (rc 7.0.1) 1 Quote Link to comment Share on other sites More sharing options...
pocopico Posted October 1, 2021 Share #10 Posted October 1, 2021 (edited) 3 hours ago, matyyy said: Not only docker causing this. searching in control panel causes it too... (rc 7.0.1) Yes synoelastid (universal search) is also one of the things that are causing soft lockups and system crash. Check the issues in Github for more details Edited October 1, 2021 by pocopico Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted October 2, 2021 Author Share #11 Posted October 2, 2021 (edited) There's a potential fix now on the github issue, but I've not had chance to test it yet. Update: Been doing some testing and it's better, but not fixed. Edited October 2, 2021 by WiteWulf Quote Link to comment Share on other sites More sharing options...
Orphée Posted October 2, 2021 Share #12 Posted October 2, 2021 Running 6.2.4 with fix deployed. Played with a lot of pictures with Synology Moments... high CPU usage while uploading, but did not crash. Running influxdb seems to be stable too. I will try the usual reboot. Quote Link to comment Share on other sites More sharing options...
Orphée Posted October 2, 2021 Share #13 Posted October 2, 2021 (edited) Reboot OK it seems stable : Edited October 2, 2021 by Orphée Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted October 2, 2021 Author Share #14 Posted October 2, 2021 Looks like we're getting somewhere: https://github.com/RedPill-TTG/redpill-lkm/issues/21#issuecomment-932808905 Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted October 4, 2021 Author Share #15 Posted October 4, 2021 With the recent commits to redpill-lkm it looks like the soft lockups on virtualised systems are now solved, but hard lockups on baremetal are still a problem. TTG theorise that this could be down to "an over-reactive kernel protection mechanism", namely the NMI watchdog. The NMI watchdog exists (among other things) to automatically reboot a linux machine if it gets hung up for some reason. Unfortunately it appears to be falsely identifying kernel hangs on baremetal with DS3615xs image. A workaround is to disable the NMI watchdog, but this is not recommended and you should understand the implications of taking this step before applying it to your system. Having said that, I've been running my system for a few days now with it disabled and it has been very stable. YMMV!!! Quote Link to comment Share on other sites More sharing options...
pocopico Posted October 4, 2021 Share #16 Posted October 4, 2021 I did try to a. Increase the watchdog threshold to 60 and b. To disable the nmi watchdog in the past (before the fixes) and i was still getting lockups. Sometimes hard, sometimes soft. It’s not a viable and long term solution. I guess disabling the nmi watchdog was just to increase the teams confidence for a possible fix. Let’s wait and see. Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted October 4, 2021 Author Share #17 Posted October 4, 2021 Yes, I agree that it's not a viable long-term solution. I also tried increasing the watchdog threshold to a higher value and, while the kernel panics took longer to occur (matching the increased timer value), they still happened eventually. I hadn't tried disabling it altogether before, though. As you say, let's wait and see. Hopefully these developments will lead to a more sustainable solution. Quote Link to comment Share on other sites More sharing options...
Orphée Posted October 4, 2021 Share #18 Posted October 4, 2021 (edited) I have a baremetal HP N54L currently running Jun's 6.2.3 (remote access only backup Syno) I'm curious to see how it will handle this once I'll migrate it.. Edited October 4, 2021 by Orphée Quote Link to comment Share on other sites More sharing options...
BettyBrookni Posted October 13, 2021 Share #19 Posted October 13, 2021 (edited) Honestly, I don't think you can have a problem with that. But as usual, stuff happens. 😏 Edited October 13, 2021 by BettyBrookni Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted October 13, 2021 Author Share #20 Posted October 13, 2021 @Orphée @nemesis122 @pocopico @erkify @dodo-dk Hi folks, I asked you previously what CPU you had in your Gen8 machines when experiencing the kernel panic and soft lock issues, and thank you for your responses. In the main thread @Kouill has demonstrated a Gen 8 running the influx-test docker stress test and it not crashing their machine since they moved to using the onboard tg3 NIC rather than PCIe NIC. I've had less successful results on my system trying the same thing (it's more stable, but still kernel panics under load), but I'm curious to know what NIC you were using on your systems? If there's a commonality of a particular PCIe NIC with the kernel panics that could be significant. Quote Link to comment Share on other sites More sharing options...
Orphée Posted October 13, 2021 Share #21 Posted October 13, 2021 Well, you already know it, but I'm using ESXi, with a LSI SAS HBA card. So I don't have a spare PCI slot. I'm using internal NIC, but with ESXi, the VM is set with E1000e or VMXNet3 with addional driver. 1 Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted October 13, 2021 Author Share #22 Posted October 13, 2021 @Kouill was kind enough to give me a copy of their boot img so I could attempt to replicate their influx-test without crashes. Unfortunately the influx-test kernel panic'd my system within a few seconds. NB. there were the following differences: - kouill boots their system from the internal microSD slot, whereas mine was on a USB stick on the internal port - in the grub.cfg I changed the serial number to match mine - in the grub.cfg I changed mac1 to match mine - in the grub.cfg I changed I added mac2 with my mac address Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted October 13, 2021 Author Share #23 Posted October 13, 2021 Just now, Orphée said: Well, you already know it, but I'm using ESXi, with a LSI SAS HBA card. So I don't have a spare PCI slot. I'm using internal NIC, but with ESXi, the VM is set with E1000e or VMXNet3 with addional driver. Yeah, you've got something in your PCIe slot, though. That may be significant. Quote Link to comment Share on other sites More sharing options...
Orphée Posted October 13, 2021 Share #24 Posted October 13, 2021 In virtual machine ? Do you really think so ? the SAS card was not passed through to this VM. Before the latest patch I was able to make VM crash with default virtual SATA disk (no real HDD with HBA passthrough). I was running : - main prod Xpenology Jun's loader with HBA card passthrouh - new VM Redpill with virtual SATA drive only. Quote Link to comment Share on other sites More sharing options...
WiteWulf Posted October 13, 2021 Author Share #25 Posted October 13, 2021 From what TTG said, the hard lockups (kernel panics) and soft lockups where quite different, though. I guess my question may only be relevant to those running on baremetal. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.