Jump to content
XPEnology Community

DS918+ 1.04b - Strange issue with "disappearing" hard drives on shutdown.


NooL

Recommended Posts

15 minutes ago, flyride said:

Really, I don't think adding a single drive will get you too much. If we take the example of your SHR which is 7 data spindles for the first 27TB and then 3 for the remaining space, and expand it with another disk, 50% of the new data blocks are still on the 8TB-only part of the array. So you would get 4TB of theoretical improved performance but not for the last 4TB or accessing any of the files already on the 8TB-only part of the array.

 

Hmm i must have misunderstood you then, my logic was based on you saying that it was most likely due to more than 28TB of my current volume of 36TB~ being used, adding 8TB extra would decrease the overall space used or so i understood your earlier post. :)

If it was the CPU limiting it, i would expect to see higher utilization or IO wait numbers? during transfers?


billede.png.00f5126969a814e6e399adc6dc45ef60.png

 

 

Link to comment
Share on other sites

3 hours ago, IG-88 said:

there are some exceptions on boards like apollolake, geminilake, usually 2 real onboard (soc) and when 2 more are added with a asm1061 then its one pcie lane with pcie 2.0, limiting the 2 ports to ~500MB/s together - not usable for SSD

Technically true for some low-end boards, but OP has a B365 board with six chipset-based SATA ports.

Link to comment
Share on other sites

8 hours ago, NooL said:

 

Hmm i must have misunderstood you then, my logic was based on you saying that it was most likely due to more than 28TB of my current volume of 36TB~ being used, adding 8TB extra would decrease the overall space used or so i understood your earlier post. :)

I think it's worth explaining how this works a little more.

 

You currently have a SHR2 of 4x 4TB drives and 5x 8TB drives.

 

There is a 4TB array across all 9 drives.  Subtract two for parity (RAID6) and you have 7 data members with 4TB each.  I/O is split across all those spindles so your performance generally should be at maximum unless there is something wrong with one of the members.  7 x 4TB = 28TB so once that 28TB is filled up, there is no more space on the 7-spindle data array.

 

There is a second 4TB across the five 8TB drives because SHR is trying to maximize your storage.  Subtract two for RAID6 parity and you have 3 data members with 4TB each.  Any I/O that goes here (because 28TB is filled up) is limited to only 3 data spindles.  That won't be as fast as the other part both for fewer spindles, and also if the first array is busy, since the arrays are competing with each other for the 8TB drives.

 

If you add an 8TB drive, both arrays will add a member and gain 4TB more space each.  If you add new files, eventually the 4TB of space in the 10-drive array will get used but you don't have much control over where the filesystem puts it, so it might get spread across both arrays since there are already files there.  If you modify a file in the second array it will probably stay there. Remember this is all transparent to you - all you see is a unified filesystem.

 

So your benefits will be sporadic and unpredictable, at a minimum.  You can't really fix the problem completely until you get rid of the SHR.  And I'm not completely convinced it matters that much.

 

Quote

If it was the CPU limiting it, i would expect to see higher utilization or IO wait numbers? during transfers?

billede.png.00f5126969a814e6e399adc6dc45ef60.png

 

This doesn't tell us very much.  You need a lot of monitoring over time to get a sense of what is going on (iostat -m 2 is a good command to see it in real time).  Let me just say that I have no trouble driving my 4C/8T Skylake 3.5Ghz CPU to 40% utilization on 6xRAID5 writes when running at 10Gbe wire speed.  At 1Gbe it doesn't matter.  But with a total of 14 array members (9 across all disks, 5 extra across 8TB disks) and the additional overhead of RAID6, I think it is entirely possible for your 2C/4T CPU to hit thread availability/latency limits without overall utilization driving up very high.

 

One thing we haven't tried is bumping up your stripe cache, which actually might be a pretty big benefit given your system configuration. But it is a hit to RAM. How much do you have installed?

 

Post the results of the following:

ls /sys/block/md*/md/stripe_cache_size

cat /sys/block/md*/md/stripe_cache_size

Edited by flyride
Link to comment
Share on other sites

@flyrideAhh makes sense in regards to the SHR2.

 

My Stripe cache was at 4096, tried raising it to 32768 but without any noticable result except for a bit higher RAM usage (I have 16GB)

 

Im starting to think i have one or two shady disks, i keep seeing higher utilizations on 2 disks (drive b and drive k) along with noticably lower Read/s and (This is during a btrfs scrub).

 

billede.thumb.png.64a15ba489a8fa625ccac333bf122cb1.png

 

 

Looks kinda off doesnt it? It seems to be peristent on those 2 drives in high activity scenarios.

 

I've run extended smart diag on all the drives, all came back healthy, but drive B (Drive 2) has a disk shift value which i find odd, although google says not to worry nescessarily.

billede.thumb.png.0b419ce9c50648d515c13b85c6aec462.png

 

Disk K (Drive 11) has a Raw_Read_Error_Rate of 1 but no reallocated sectors.

 

billede.png.858333007ae192909117b9cea3e2d49e.png

 

Whatcha reckon? Cause for concern or am I reading too much into it?

 

 

Edited by NooL
Link to comment
Share on other sites

Well the performance is cause enough for you to be concerned, so if you can come up with a spare drive or two, I would swap them in and see if your performance improves.  One slow drive will drag down an array for sure.

 

Another thing that would be interesting to try is to swap drives into those slots to see if the performance markers follow the drives.  As long as the arrays are healthy when you shut down, there will be no issue with the system recognizing and accommodating them in their new positions.

Link to comment
Share on other sites

  • 2 weeks later...

Haven't been much update from me as I havent gotten around to doing much yet, I have a new Inter-Tech 3U case coming soon, along with a new HBA card and new psu. 

 

Havent exactly laid a plan yet, but I think i'll go ahead and split up the volume so that i have a volume with the WD red's and one with the Toshiba's, this way i can run pure raid5 or 6, the only problem is that i have to temporarily find a place to dump 30TB of data, ill probably purchase som external drives as a temp location, I'll probably also move as many drives as I can onto the onboard SATA. 

 

The new hardware: 

https://www.inter-tech.de/en/products/ipc/storage-cases/3u-3416

 

https://www.newegg.com/seasonic-ssn-7522g-2-x-750w/p/N82E16817151135

 

 

One positive note though, I figured out the transcoding issue - I had enabled forced subtitles thinking it would use my external SRT subs, but instead it uses cpu power to hardcode it into the movie on the fly, with this disabled everything seems to be workign perfectly :) 

 

 

Link to comment
Share on other sites

  • 8 months later...

I just had the same mystery disk missing problem. Google search landed here.

 

Here is my situation.

 

I do have 8 x 2TB SAS drives + 2 x SATA drives. 9 of them form an SHR2 pool and 1 of them is used as hot spare. HBA is LSI 9207-8i (FW 20.00.07.00), running on ESXi with PCIe passthrough. Between HBA and HDD, I used an IBM SAS expander to make HBA accept more than 8 drives.

 

After a power loss, 2 x SAS drives are missing. I thought it might be disk failure. Since I have one hot spare, so I added another 2TB SATA drive to fix the pool.

 

After a few days, I got another 2 x SAS drives missing. Fortunately, I have SHR2 so I didn't have data loss. The pool is still usable.

 

After several reboot, all 4 x SAS drive came back with error message like "cannot access system partition". Then, another reboot make them disappear again.

 

Then, I tried to remove the IBM SAS expander from the system, and use an LSI 9211-8i (FW 20.00.07.00). I connect 8 x SAS drives to LSI 9207-8i, and 3 x SATA drives to LSI 9211-8i.

 

First, I did setup PCIe passthrough of both HBAs. Same problem. Those 4 missing drives come up and down randomly.

 

Then, I tried RDM without PCIe passthrough (let ESXi manage the HDDs, and passthrough each individual HDD). When RDM through a virtual SCSI controller (virtual LSI SAS), same issue. But, as soon as I change the setting to make RDM through a virtual SATA controller, problem is resolved. All 4 missing drives are back. They are running stable, no matter how many times I reboot.

 

If I remember correct, @IG-88 mentioned in his extension thread that DS918+ cannot handle mpt2sas very well. Both official driver and @IG-88 compiled driver have some weird issues. So, I'm guessing maybe the problem is caused by the mpt2sas driver in DSM.

 

For me, the solution is to let DSM think they are all SATA HDDs (disable passthrough entire HBA, use RDM instead, with a virtual SATA controller).

 

 

 

An additional FYI (maybe off-topic), I found the lsi_msgpt2 driver in ESXi 6.7u3 has some performance issue. The disk access speed was good for only 5 seconds, after that, I got ~20MB/s read/write. The solution is disable lsi_msgpt2 and use mpt2sas (need to download 20.00.00.00 version from VMware website).

Link to comment
Share on other sites

5 hours ago, snailium said:

I just had the same mystery disk missing problem. Google search landed here.

 

Here is my situation.

 

I do have 8 x 2TB SAS drives + 2 x SATA drives. 9 of them form an SHR2 pool and 1 of them is used as hot spare. HBA is LSI 9207-8i (FW 20.00.07.00), running on ESXi with PCIe passthrough. Between HBA and HDD, I used an IBM SAS expander to make HBA accept more than 8 drives.

 

After a power loss, 2 x SAS drives are missing. I thought it might be disk failure. Since I have one hot spare, so I added another 2TB SATA drive to fix the pool.

 

After a few days, I got another 2 x SAS drives missing. Fortunately, I have SHR2 so I didn't have data loss. The pool is still usable.

 

After several reboot, all 4 x SAS drive came back with error message like "cannot access system partition". Then, another reboot make them disappear again.

 

Then, I tried to remove the IBM SAS expander from the system, and use an LSI 9211-8i (FW 20.00.07.00). I connect 8 x SAS drives to LSI 9207-8i, and 3 x SATA drives to LSI 9211-8i.

 

First, I did setup PCIe passthrough of both HBAs. Same problem. Those 4 missing drives come up and down randomly.

 

Then, I tried RDM without PCIe passthrough (let ESXi manage the HDDs, and passthrough each individual HDD). When RDM through a virtual SCSI controller (virtual LSI SAS), same issue. But, as soon as I change the setting to make RDM through a virtual SATA controller, problem is resolved. All 4 missing drives are back. They are running stable, no matter how many times I reboot.

 

If I remember correct, @IG-88 mentioned in his extension thread that DS918+ cannot handle mpt2sas very well. Both official driver and @IG-88 compiled driver have some weird issues. So, I'm guessing maybe the problem is caused by the mpt2sas driver in DSM.

 

For me, the solution is to let DSM think they are all SATA HDDs (disable passthrough entire HBA, use RDM instead, with a virtual SATA controller).

 

 

 

An additional FYI (maybe off-topic), I found the lsi_msgpt2 driver in ESXi 6.7u3 has some performance issue. The disk access speed was good for only 5 seconds, after that, I got ~20MB/s read/write. The solution is disable lsi_msgpt2 and use mpt2sas (need to download 20.00.00.00 version from VMware website).

I agree, this seems quite identical to my situation. 

 

One thing i am curious about - On the DS918+ in grub.cfg it has the arguments: 

syno_hdd_powerup_seq=1

HddHotplug=0

syno_hdd_detect=0

 

While on the 3615xs (Not sure about the 17) - these arguments are different, with the hdd_powerup_seq having the value 0, and the syno_hdd_detect seems to be missing.  

I havent been able to find any information about what this actually does. 

 

But I do find the situation a bit strange, and not very comforting :P - i still get a bit anxious if i have to turn off/reboot the server for some reason. 

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...