flyride

NVMe optimization & baremetal to ESXi report

Recommended Posts

Just thought to share some of the performance I'm seeing after converting from baremetal to ESXi in order to use NVMe SSDs.

My hardware: SuperMicro X11SSH-F with E3-1230V6, 32GB RAM, Mellanox 10GBe, 8-bay hotplug chassis, with 2x WD Red 2TB (sda/sdb) in RAID1 as /volume1 and 6x WD Red 4TB (sdc-sdh) in RAID10 as /volume2

 

I run a lot of Docker apps installed on /volume1. This worked the 2TB Reds (which are not very fast) pretty hard, so I thought to replace them with SSD.  I ambitiously acquired NVMe drives (Intel P3500 2TB) to try and get those to work in DSM. I tried many tactics to get them running in the baremetal configuration. But ultimately, the only way was to virtualize them and present them as SCSI devices.

 

After converting to ESXi, /volume1 is on a pair of vmdk's (one on each NVMe drive) in the same RAID1 configuration. This was much faster, but I noted that Docker causes a lot of OS system writes which spanned all the drives (since Synology replicates the system and swap partitions across all devices). 32GB RAM is enough to avoid swap activity so that was irrelevant, so I isolated DSM I/O to the NVMe drives by disabling the system partition from the RAID10 disks:

mdadm /dev/md0 -f /dev/sdc1 .../dev/sdh1

then

mdadm -grow -n 2  /dev/md0

then repair from DSM Storage Manager (which converts the "failed" drives to hotspares)

 

After this, no Docker or DSM system I/O ever touches a spinning disk. Results: The DSM VM now boots in about 15 seconds. Docker used to take a minute or more to start and launch all the containers, now about 5 seconds.  Copying to/from the NVMe volume maxes out the 10GBe interface (1 gigaBYTE per second) and it cannot fill the DSM system cache; the NVMe disks can sustain the write rate indefinitely.  This is some serious performance, and a system configuration only possible because of XPEnology!

 

Just as a matter of pointing out what is possible with Jun's boot loader, I was able to move the DSM directly from baremetal to ESXi, without reinstalling, by passthru of the SATA controller and the 10GBe NIC to the VM.  I also was able to switch back and forth between USB boot using the baremetal bootloader menu option and ESXi boot image using the ESXi bootloader menu option. Without the correct VM settings, this will result in hangs, crashes and corruption, but it can be done.

 

I did have to shrink /volume1 to convert it to the NVMe drives (because some space was lost by virtualizing them), but ultimately was able to retain all aspects of the system configuration and many months of btrfs snapshots converting from baremetal to ESXi. For those who are contemplating such a conversion, it helps to have a mirror copy to fall back on, because it took many iterations to learn the ideal ESXi configuration.

  • Like 2
  • Thanks 1

Share this post


Link to post
Share on other sites
On 3/20/2018 at 9:01 AM, flyride said:

 


mdadm /dev/md0 -f /dev/sdc1 .../dev/sdh1

then


mdadm -grow -n 2  /dev/md0

then repair from DSM Storage Manager (which converts the "failed" drives to hotspares)

interesting move, not just for VM use

how does it look in storage manager? the drives are spare an also part of a data volume (volume2 in your case i guesss as volume1 is on nvme together with docker and packages)

Share this post


Link to post
Share on other sites

Just to clarify the layout:

 

/dev/md0 is the system partition, /dev/sda.../dev/sdn.  This is a n-disk RAID1

/dev/md1 is the swap partition, /dev/sda../dev/sdn.  This is a n-disk RAID1

/dev/md2 is /volume1 (on my system /dev/sda and /dev/sdb RAID1)

/dev/md3 is /volume2 (on my system /dev/sdc.../dev/sdh RAID10)

and so forth

 

Failing RAID members manually in /dev/md0 will cause DSM to initially report that the system partition is crashed as long as the drives are present and hotplugged.  But it is still functional and there is no risk to the system, unless you fail all the drives of course.

 

At that point cat /proc/mdstat will show failed drives with (F) flag but the number of members of the array will still be n.

 

mdadm -grow -n 2  /dev/md0 forces the system RAID to the first two available devices (in my case, /dev/sda and /dev/sdb).  Failed drives will continue to be flagged as (F) but the member count will be two.

 

After this, in DSM it will still show a crashed system partition, but if you click on Repair, the failed drives will change to hotspares and the system partition state will be Normal.  If one of the /dev/md0 RAID members fails, DSM will instantly activate one of the other drives as the hotspare, so you always have two copies of DSM.

 

FYI, this does NOT work with the swap partition.  No harm in trying, but DSM will always report it crashed and fixing it will restore swap across all available devices.

 

It's probably worth mentioning that I do not use LVM (no SHR) but I think this works fine in systems with SHR arrays.

 

  • Like 1

Share this post


Link to post
Share on other sites

yes i got the mdadm part (-n 2) but if dsm sees the whole drive as spare then it might try to use the "whole" drive as spare, like using a drive thats part of volume2 as spare drive for volume1 (when one of the nvme's or vmware vmfs fails), but is guess if dsm tries to use the partiton that is still part of md3 as spare for md2 then it will hit the wall and fail to do anything harmful to md3

nice solution for the disks keep spinning problem i guess, i will keep that in mind when people complain about that or wanting ssd only system partiton(s)

pity i dont have any sata ports left, that method would justify two cheap small ssd's, maybe  end of the year when i might replace my 4TB disks with 12TB disk then i will have sata ports open

 

Share this post


Link to post
Share on other sites
Posted (edited)

This doesn't get any space back, just avoiding disk access.  The system drive partition structure is intact on all drives even after the adjustment.  So if DSM "reclaims" via hotspare activity or other, it only operates within the preallocated system partition.  So no possibility of damage to any other existing RAID partition on the drives.

 

If the system or swap partitions are deleted on any disk, DSM will call the drive Not Initialized.  Any activity that initializes a drive will create them, no exceptions

Edited by flyride

Share this post


Link to post
Share on other sites

UPDATE: (tl;dr)

  1. Heretofore, XPEnology DSM under ESXi using virtual disks is unable to retrieve SMART information from those disks. Disks connected to passthrough controllers work, however.

  2. NVMe SSDs are now verified to work with XPEnology using ESXi physical Raw Device Mapping (RDM). pRDM allows the guest to directly read/write to the device, while still virtualizing the controller.

  3. NVMe SSDs configured with pRDM are about 10% faster than as a VMDK, and the full capacity of the device is accessible.

  4. Configuring pRDM using the ESXi native SCSI controller set specifically to use the "LSI Logic SAS" dialect causes DSM to generate the correct smartctl commands for SCSI drives. SMART temperature, life remaining, etc are then properly displayed from DSM, /var/log/messages is not filled with spurious errors, and drive hibernation should now be possible.

Like many other posters, I was unhappy with ESXi filling the logfiles with SMART errors every few seconds, mostly because it made the logs very hard to use for other things.  Apparently this also prevents hibernation from working. I was able to find postings online using ESXi and physical RDM to enable SMART functionality under other platforms, but this didn't seem to work with DSM, which apparently tries to query all drives as ATA devices. This is also validated by synodisk --read_temp /dev/sdn returning "-1"

 

I also didn't believe that pRDM would work with NVMe, but in hindsight I should have known better, as pRDM is frequently used to access SAN LUNs, and it is always presented as SCSI to the ESXi guests.  Here's how pRDM is configured for a local device: https://kb.vmware.com/s/article/1017530  If you try this, understand that pRDM presents the whole drive to the guest - you must have a separate datastore to store your virtual machine and the pointer files to the pRDM disk!  By comparison, a VMDK and the VM that uses it can coexist on one datastore.  The good news is that none of the disk capacity is lost to ESXi, like it is with a VMDK.

 

Once configured as a pRDM, the NVMe drive showed up with its native naming and was accessible normally.  Now, the smartctl --device=sat,auto -a /dev/sda syntax worked fine!  Using smartctl --device=test, I found that the pRDM devices were being SMART-detected as SCSI, but as expected, DSM would not query them.

 

NVMe device performance received about a 10% boost, which was unexpected based on VMWare documentation.  Here's the mirroring operation results:

root@nas:/proc/sys/dev/raid# echo 1500000 >speed_limit_min
root@nas:/proc/sys/dev/raid# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [raidF1]
<snip>
md2 : active raid1 sdb3[2] sda3[1]
      1874226176 blocks super 1.2 [3/1] [__U]
      [==>..................]  recovery = 11.6% (217817280/1874226176) finish=20.8min speed=1238152K/sec
<snip>

Once the pRDM drive mirrored and fully tested, I connected the other drive to my test VM to try a few device combinations.  Creating a second ESXi SATA controller has never tested well for me.  But I configured it anyway to see if I could get DSM to use SMART correctly.  I tried every possible permutation and the last one was the "LSI Logic SAS" controller dialect associated with the Virtual SCSI controller... and it worked!  DSM correctly identified the pRDM drive as a SCSI device, and both smartctl and synodisk worked!

 

root@testnas:/dev# smartctl -a /dev/sdb
smartctl 6.5 (build date Jan  2 2018) [x86_64-linux-3.10.102] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor:               NVMe
Product:              INTEL SSDPE2MX02
Revision:             01H0
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        Solid State Device
Logical Unit id:      error: bad structure
Serial number:        CVPD6114003E2P0TGN
Device type:          disk
Local Time is:        Wed Mar 28 19:00:29 2018 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     26 C
Drive Trip Temperature:        85 C
<snip>
root@testnas:/dev# synodisk --read_temp /dev/sdb
disk /dev/sdb temp is 26

And, in Storage Manager:

 

xp6.jpg

 

Finally, /var/log/messages is now quiet.  There is also a strong likelihood that drive hibernation is also possible, although I can't really test that with NVMe SSD's.

  • Like 1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now