Sign in to follow this  
yale-xpo

New DSM ESXi Build Crashing When Idle - LSI 9211-8i

Recommended Posts

I'm out of ideas of how to debug this, but perhaps someone else knows what's going on. This is probably not an Xpenology specific issue, but I'm sure others here have experience using the LSI 9211-8i and ESXi together.

 

I recently did a new build for my NAS with the following hardware:

 

  • Ryzen 5 1600
  • ASRock B350 Pro 4 Mobo
  • WD Green m.2 SATA SSD (used as VM datastore only)
  • LSI 9211-8i SAS Controller

 

I am running ESXi 6.5u1 on this setup which was perfectly stable until I added the LSI 9211-8i SAS Controller, since adding this card the system crashes if left idle for more than 30 minutes.

 

Timeline of events:

 

  • Initially using a LSI 3081E-R controller in pass-thru mode (with test drives since I didn't realise >2TB drives were not supported when I picked up the card for $30.)
  • Online - Several weeks without issue until my LSI 9211-8i controller arrived.
  • Added LSI 9211-8i controller to DSM VM in pass-thru mode with 4x 4TB WD RED HDDs
  • Did disk parity check performed on new SHR array. Took about 8 hours.
  • Offline - Less than an hour after parity check completed the system 'hard locks'.
  • Thought it might be a one-off incident. Rebooted the system, did another parity check to make sure everything was ok.
  • Offline - Again, less than an hour after parity check completed the system 'hard locks'.
  • Updated the firmware on the LSI 9211-8i to v20 using IT mode.
  • Thinking the problem was solved as I was now using ESXi supported firmware, started a data transfer task of 8TB of data, this took about 18 hours and the system stayed online the entire time without a hitch. Data transfer finishes, feeling pretty confident the firmware upgrade has worked since it's been online for this long, spend a few hours doing benchmarks and setting things up. System was online for over 24 hours.
  • Offline - An hour or so after I stop 'doing stuff' with the system, it's hard locked again.
  • Thinking the issue might be IOMMU pass-thru support, I disabled pass-thru mode, installed the official driver from the vmware site for the LSI 9211-8i and mounted the disks into the DSM using ESXi Raw Disk Mappings.
  • Offline - An hour or so after I stop 'doing stuff' with the system, it's hard locked again.
  • Realised it's only crashing when the system is 'idle'. Looked into power saving modes, disabled "C-State" and "Cool 'n' Quiet" in the host bios. Also disabled all power saving features in ESXi.
  • Offline - An hour or so after I stop 'doing stuff' with the system, it's hard locked again.
  • This morning I hooked up a screen to the host in an attempt to see if there was a PSOD or any message before it locks, there was not - just a black screen. System not responsive to short-press power button or keyboard.
  • Offline - less than 40 minutes after sitting idle from boot, it's hard locked again.

 

In addition to the steps taken above, I have tried to find a reason for the lock ups in the ESXi logs, there is seemingly nothing, one minute it's working fine, then nothing until it's booting up again after I power cycle the system. There are no core dumps. Each time nothing was displayed on the screen of the host .

 

It just seems odd the system would crash only when sitting idle - it would make more sense if it was crashing under load.

 

Now these issues only started to occur once I added the LSI 9211-8i to the mix, the LSI 3081E-R did not cause these same issues.

 

Do I just have a dodgy card? I don't want to buy another LSI 9211-8i if I'm going to have the same issues.

Is there another card I should get instead?

Are there any other settings I should try to make this system stable?

Edited by yale-xpo

Share this post


Link to post
Share on other sites

Could you setup a bare metal rig to take esx out of the equation - a lot of work and you will need spare drives etc, but at least you eliminate one 'layer' of technology.

Share this post


Link to post
Share on other sites

Seems like it's actually just a Ryzen + Linux issue. I have the same issue running ubuntu server on bare metal (tested up to 4.14rc1 kernel).

Share this post


Link to post
Share on other sites

Looking at when the system locks, it seems to be when 'idle' for a time.

I've not used the LSI card, but are there bios settings for disk spin down or hibernation in there that might 'conflict' with DSM or other O/S power saving. Does the mobo have any bios settings for power save/hdd hibernation to adjust? Maybe turn off all 'green/power saving' features and see what happens?

Share this post


Link to post
Share on other sites

Thanks for your suggestions. I tried removing the LSI card and the issue still persisted, so I think in the end it had nothing to do with it. I don't know why it didn't start showing up until recently, I may have kept some other VMs powered on which kept the CPU active enough not to crash.

 

I'm going to try and return the Ryzen CPU and motherboard under warranty (should get a full refund due to my state's consumer protections) and replace them with these Intel parts:

  • Intel Core i7-7700 3.6GHz Quad-Core Processor
  • ASRock - H270M Pro4 Motherboard
Edited by yale-xpo

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this