DSM not starting SOMETIMES in proliant gen8 G1610T


Recommended Posts

Hello guys,

 

I am getting crazy with this one.

 

I have an HP Proliant gen8 G1610T with Jun 1.02b ds3615xs on it. Everything is running fine except sometimes when I reboot my server or I power it off/power on DMS is no starting!

Le USB drive is loading because I see the “Booting Kernel” or “uncompressing kernel” message on the CLI console but DMS does not start. I don’t see it in synoassistant and it’s not responding to a ping through the network.

 

How do I solve it? I just manually reset the server. Either reset from the ILO either power off / power on.

 

This issue is completely RANDOM. Sometimes it happens, sometimes not, and this is what is killing me. I have been days without the issue doing reboots and sometimes I have it everyday...

 

Should I test with ds3617xs? I searched on the forum but didn’t see this issue to others....

 

Help please.

Link to post
Share on other sites

check your network cables and switch connections

also check your router/dhcp server if you have set a static IP on the NAS or an address reservation on the router - only set one (I would recommend a router reservation)

if you run a network sniffer you should be able see if the DHCP request/response traffic

Are you using the native nic MAC address or a different one? - use the native MAC.

It sounds to me that there is some sort of problem with IP address being allocated and the above might help you trace it

Link to post
Share on other sites

Hello @sbv3000,

 

Thank you for your response. It’s funny, I am a network egineer so I have never questionned the network in this issue.... Specially because I had the same set up before with “openmediavault” and it was working perfectly. I am using a static @IP adress in the server out of the DHCP scope of the DHCP server. I’m going to try to put a reservation. That’s easy and we will see....

 

I am using the native NIC MAC adress. I changed that in creation process of the pendrive.

 

The DHCP server is a fortinet firewall, I should be able to debug and sniff the DORA process for the DHCP...I can try this also...

 

Are you sure there is no possible “loader” error or something?

Link to post
Share on other sites

@sbv3000,

 

Same behavior with a reservation on the DHCP server. I had some hope as the first 4 or 5 reboots where ok but then it failed.... Sticks in “booting kernel”...but nothing happens....

I am really not sure about the network issue... any other idea?

 

@Dfds, I already tryed this, same issue...

Link to post
Share on other sites

Not sure if your model has the feature, but can you access the ilo ssh/serial port and watch the 'full boot' process? You might see a kernel panic/driver issue or other error.

Another test could be to use a spare hdd and create a clean install of DSM (disconnect your raid drives of course :) ) and see if you can replicate.

Link to post
Share on other sites
  • 1 month later...
  • 1 month later...

Same here with Jun 1.02b ds3615xs, i dont have a screen plugged in so cannot see the status.  However I have switched to DHCP (not static or allocated on router) and so far seems to be OK. 

 

Any thoughts on why would changing to DS3617XS fix this, and where you got the idea?

 

 

Link to post
Share on other sites
  • 2 years later...

Hey guys,

 

I am bringing back my own 2018 topic again.

I know I told you I fixed this issue by updating to DS3617xs, but the fact is that once I had to move to DSM 6.2.1, I purchased an intel network card because of the compatibility issues, disabled the integrated NICs and had to go with DS3615xs as the DS3617xs was not working in that setup.

I decided to keep the server up and running 24/7 to avoid any "not booting" issue. But know I want to start having the server off for some time and just power on-demand. I am facing again this issue where sometimes the server is not booting and I am forced to reset.

It's painful....

 

Just wondering if someones have faced this before and found some workaround?

Link to post
Share on other sites

i'd suggest to update to 6.2.3 and check for crashed drivers in dmesg

with 6.2.1 and the original extra.lzma i'd expect some drivers crash on boot (that's the reason your onboard broadcom failed)  and that can results in restart and shutdown not working

when using 6.2.3 the old original extra.lzma will work again properly (synology reverted the changes from 6.2.1/6.2.2) and you onboard nic should be working again

 

if disk hibernation does not work with 6.2.3 then its because of log activity, fix it with this

https://xpenology.com/forum/topic/32861-hdd-fail-to-hibernate-after-upgrade-from-622-to-623/?do=findComment&comment=179577

 

 

Link to post
Share on other sites
1 hour ago, IG-88 said:

i'd suggest to update to 6.2.3 and check for crashed drivers in dmesg

with 6.2.1 and the original extra.lzma i'd expect some drivers crash on boot (that's the reason your onboard broadcom failed)  and that can results in restart and shutdown not working

when using 6.2.3 the old original extra.lzma will work again properly (synology reverted the changes from 6.2.1/6.2.2) and you onboard nic should be working again

 

if disk hibernation does not work with 6.2.3 then its because of log activity, fix it with this

https://xpenology.com/forum/topic/32861-hdd-fail-to-hibernate-after-upgrade-from-622-to-623/?do=findComment&comment=179577

 

 

 

Thanks for your response.

I might haven’t been clear enough on my last post. I have been performing the updates since 6.1 and now I am running the lastest version of 6.2.3.

I have been suffering this issue since I can remember. The thing is... I am not sure about the drivers. If it was related only with the NIC driver, the system will boot up anyway right? But when I am forced to reset because the system is not booting, after DSM is running again, I am not having any message saying the it has been shut down unexpectedly.

 

What are your thoughts on this? 

Edited by siulman
Link to post
Share on other sites
10 minutes ago, siulman said:

But when I am forced to reset because the system is not booting, after DSM is running again,

no, if its not booting and then boots pressing the reset button then i have no idea beside bios update

if its a microserver gen8 then a lot of people have this hardware and dont have such problems, so you might search the forum about the bios version they using

 

i do remember that there was a problem with a older microserver that lost bios settings after shutdown

look here

https://xpenology.com/forum/topic/1909-annoying-bios-reset-issue-is-really-bugging-me/

Link to post
Share on other sites
On 11/23/2020 at 2:42 PM, IG-88 said:

no, if its not booting and then boots pressing the reset button then i have no idea beside bios update

if its a microserver gen8 then a lot of people have this hardware and dont have such problems, so you might search the forum about the bios version they using

 

i do remember that there was a problem with a older microserver that lost bios settings after shutdown

look here

https://xpenology.com/forum/topic/1909-annoying-bios-reset-issue-is-really-bugging-me/

 

I've tried to update firmware which I think updates BIOS?

I don't know what to do anymore...it is very frustrating. It can boot of for three times and the fourth is getting stuck. It is random.

 

tempsnip.png

Edited by siulman
Link to post
Share on other sites
6 minutes ago, flyride said:

I just have to point out that it's the same hardware and network and you've experienced this problem across three different DSM versions.

 

This is not a normal behavior.  Something in your local environment is causing it.

 

 

Well...when it comes to local environnement (LAN & Network) I am pretty confortable as it's my day to day so if you have any idea please share.... because I don't think it's on the network.

I have a DHCP reservation to make sure DHCP provides same IP to my mac adress (before I used static but changed just to see if anything different could improve...). And also how it is possible that the issue is random?

When it does not boot, it's not related with the NIC as the system is not booting at all. If it was, it wouldn't respond to ping of course but I would be receiving a notification from the DSM saying it was shut down unexpectedly due to my "reset" to make it work because it would have gone live!

 

So, let's brainstorm together... what can make that sometimes the system boots and sometimes not...?

 

Before 6.1 I found a workaround which was changing to 3617xs. I didn't have this issue on this one. But when upgrating to 6.2.1 onwards, I realized 3617xs was not working anymore and went back to 3615xs. Unfortunatelly, 3615 has this issue for me....

I've tested of course different pendrives but same....

 

Any information I can provide so you can provide guidance let me know.... like BIOS config or other....

 

Thanks in advance!

 

 

Link to post
Share on other sites
43 minutes ago, siulman said:

When it does not boot, it's not related with the NIC as the system is not booting at all. If it was, it wouldn't respond to ping of course but I would be receiving a notification from the DSM saying it was shut down unexpectedly due to my "reset" to make it work because it would have gone live!

 

I don't really understand this statement, and don't think your logic is correct here.  You posted a boot screen when you cited that it was not booting at all.  That is all that is ever displayed from the loader, so we know the boot loader has executed.  DSM does not post a notification to the VGA screen during a reset, and if the network wasn't working there would be no notification via any other means.  So you probably don't really know the state of things when you do not have a network connection.

 

The only way to properly troubleshoot this is via serial, and you will be able to monitor DSM's state without any network connection.  How to set up and use a serial connection is well chronicled on this forum....

 

Also if you are trying to minimize all variability, you should reconfigure back to static IP.

Link to post
Share on other sites
28 minutes ago, flyride said:

 

I don't really understand this statement, and don't think your logic is correct here.  You posted a boot screen when you cited that it was not booting at all.  That is all that is ever displayed from the loader, so we know the boot loader has executed.  DSM does not post a notification to the VGA screen during a reset, and if the network wasn't working there would be no notification via any other means.  So you probably don't really know the state of things when you do not have a network connection.

 

The only way to properly troubleshoot this is via serial, and you will be able to monitor DSM's state without any network connection.  How to set up and use a serial connection is well chronicled on this forum....

 

Also if you are trying to minimize all variability, you should reconfigure back to static IP.

 

 

All right, let me explain again.

When your DSM is up and running and you force a hard reset from the physical button (or the logical one from the ILO interface), the server reboots. Once DSM is up and running again you get a message saying that your system was unexpectedly stopped right?

Ok, so I was reacting to a previous statement that said that it might be the NIC driver failing. If that was the case, and my DSM does not respond (not being able to ping it or to connect to it on web), then the system would be up and running right? Only I wouldn’t have network connectivity… But when forcing a reset through the button, next time it starts I should get the message saying the it was not properly stopped. Please, correct me if I am wrong.

That being said, when you say troubleshooting via serial you mean with a VGA screen connected to it? I have ILO access. Am I not supposed to see the same through it? I will try to look for this procedure in this forum…

Link to post
Share on other sites
Just now, siulman said:

When your DSM is up and running and you force a hard reset from the physical button (or the logical one from the ILO interface), the server reboots. Once DSM is up and running again you get a message saying that your system was unexpectedly stopped right?

 

Not guaranteed.  The system may be booting but not in a state where it can generate that log.

 

1 minute ago, siulman said:

That being said, when you say troubleshooting via serial you mean with a VGA screen connected to it? I have ILO access. Am I not supposed to see the same through it? I will try to look for this procedure in this forum…

 

No, I mean connecting to a COM port.  The DSM console (all the boot messages and direct Linux access) are only accessible via serial, not the hardware console.

Link to post
Share on other sites
17 minutes ago, flyride said:

 

Not guaranteed.  The system may be booting but not in a state where it can generate that log.

 

 

No, I mean connecting to a COM port.  The DSM console (all the boot messages and direct Linux access) are only accessible via serial, not the hardware console.

 

I am affraid I won't be able to get a console cable right now... I would need a usb/console adapter + the console cable.

Any other idea in the meantime?

I've teste to put the static IP back but it's the same...

Link to post
Share on other sites

The only thing I can think to do is have you set a static IP on a PC and then remove all other devices (including firewall) except the NAS and the PC from the network switch.  If it worked reliably then at least you would know it was related to one of the removed devices.

 

If it continues to fail, there are still many failure possibilities - DSM, NIC, server hardware itself.  I'll also point out that you switched NICs during this process and the problem persisted, so that doesn't suggest intermittent NIC to me.  But on 6.2.3 you should be able to use your onboard NIC okay so you could swap the Intel CT for the onboard as it seems you want to try most everything you can.

Edited by flyride
Link to post
Share on other sites
2 minutes ago, flyride said:

The only thing I can think to do is have you set a static IP on a PC and then remove all other devices (including firewall) except the NAS and the PC from the network switch.  If it worked reliably then at least you would know it was related to one of the removed devices.

 

If it continues to fail, there are still many failure possibilities - DSM, NIC, server hardware itself.  I'll also point out that you switched NICs during this process and the problem persisted, so that doesn't suggest intermittent network to me.  But on 6.2.3 you should be able to use your onboard NIC okay so you could swap the Intel CT for the onboard as it seems you want to try most everything you can.

 

I know what you mean and I'll do that with a laptop directly connected in the same subnet but I don't believe it will change anything.

Concerning the motherboard NIC, I can go for it without changing anything on the pendrive itself?

Am I able to move to 3617xs now too? It was the only setup that "fixed" this issue for me.
 

Thanks.

Link to post
Share on other sites
1 minute ago, siulman said:

I know what you mean and I'll do that with a laptop directly connected in the same subnet but I don't believe it will change anything.

Concerning the motherboard NIC, I can go for it without changing anything on the pendrive itself?

Am I able to move to 3617xs now too? It was the only setup that "fixed" this issue for me.

 

You can activate the mobo NIC without adjusting the loader.

 

I'm not sure what your last question means.  You always were able to install whatever DSM dialect you wanted that was supported by your hardware.  "Going" to 3617xs (presumably from 3615xs) involves a reinstallation of DSM, but I think you know this. You'll need to burn a new, clean loader in order to do it though.  However, if I may borrow your quote, "I don't believe it will change anything."

Link to post
Share on other sites
16 hours ago, flyride said:

 

You can activate the mobo NIC without adjusting the loader.

 

I'm not sure what your last question means.  You always were able to install whatever DSM dialect you wanted that was supported by your hardware.  "Going" to 3617xs (presumably from 3615xs) involves a reinstallation of DSM, but I think you know this. You'll need to burn a new, clean loader in order to do it though.  However, if I may borrow your quote, "I don't believe it will change anything."

 

Hello,

 

I have some updates. Here is what I've tested:

 

1) Removed the Intel NIC via PCI and enabled motherboard NIC. You were right. Not required anymore. The system boots but still same random issue of not booting randomly.

2) Decided to test a microsd instead of pendrive and went for DS3617xs (because I knew this solved my issue in the past but I wrongly thought for some reason that it was not compliant with my server anymore). Fyi, it does not involve a DSM resinstallation as you can just chose the option of "Migrate". This works when you go 3615xs <--> 3617xs no matter in what direction, it is very quick and smooth. This did the work. After multiple reboots and power off / power on I can tell you this does not happen on 3617xs. Don’t know why but it has solved it again.

That being said, and because of using a microsd, I have now it displayed as an “external drive” in my system… I’ve never seen that before with a pendrive. It has also created two shared folders that can’t be deleted as they are said to be used by the system.

I can even explore the sd-card…(grub, etc...)

Is that normal?

 

But I have a big issue: When migrating the DSM I chose the option of “downloading the latest version” which used to work. I already did this migration in the past. It installed 6.2.3-25426. When trying to update to 6.2.3-25426 update 2, it says “ Failed to install file. The file is probably corrupted

I will try to investigate this issue…but at least happy that 3617xs boots always reliably.

 

see snapshots below.

 

Capture.jpg

 

 

 

Capture2.jpg

 

 

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.