System Partition Failed on WD RED

FaberCaster · March 8, 2021

Hi everybody.
I've been looking everywhere on the forum but didn't find anything similar to the issue I'm facing.
I recently built a new unit with an AsRock J4005b-ITX, 8Gb RAM, A Dell Perch H200 (flashed to IT mode, firmware 20.00.07.00) and 6 x 2Tb drives configured as a RAID 5 with a H.S.:
2 x ST2000DL0003
2 x WD20EFAX
2 x WD20EFRX
Jun's Loader 1.04b and DSM 6.2.3-25426 U3

Installation worked just fine; in a few min I was able to find the new nas over the network and performed the basic installation steps.

The issue arises once I create the storage pool.

As soon as the parity consistency check finishes I get the "system partition failed" warning on all the WD drives.

I've tried everything I could think ok, including wiping the partitions on all the drives and reinstalling the system from scratch but no luck: I create the storage pool and get the message of the failed system partition as soon as the parity consistency check ends.

All the drives are showing fine.

I came across a couple of articles on the web stating that WD has been using the SMR technology also on the WD RED series and Synology has recently flagged these drives as not compatible with their systems.
I would like to know if there's anything else I might try, if it is worth to loose more time trying to find a fix or a workaround or simply I should not bother at all about that and keep running the system the way it is (though I'm not for that as the thought to have 4 drives with a broken system partition it's more than an itch...).

The other solution is returning the WD drives back and get some seagate in order get rid of this issue.

Any comment or suggestion would be highly appreciated.

flyride · March 8, 2021

Your WD20EFAX drives are SMR and should work but will provide substantially poor performance compared with the other drives.

The System Partition error is simply stating that the OS RAID1 is not consistent across all the drives.

It's odd that it is happening, but have you tried repairing the system partition from the Storage Manager? There should be a link on the Overview page to correct it.

FaberCaster · March 8, 2021

Hi.
thanks for the reply.

I tried already the "repair" option from Storage Manager but I keep getting the "system partition failed" right after the consistency check.
I know the EFAX are SMR and already flagged as not compatible... these were two drives I already had in an old qnap nas and just wanted to reuse them with the new hardware.

What about the EFRX? These two are brand new drives and listed as compatible with he system.

flyride · March 8, 2021

run a ssh session and post results of cat /proc/mdstat

All of your drives should physically work. The EFAX units are going to be slower. This is not a cause for your System Partition error - something else is doing that.

FaberCaster · March 9, 2021

Here's the result of cat command.
As far as I can see the system partition is only mounted on 2 drives.
Anyways, parity consistency check finished ad night and I got again the "system partition failed" after it.
I hit the "Repair" option and restart the NAS and everything comes back to "normal".

Update.

I issued the cat command after repairing and rebooting the NAS and here's the output.
And obviously DSM sees all drives in good condition.

Edited March 9, 2021 by FaberCaster

FaberCaster · March 9, 2021

Update #2.
After 3 hours the error appeared again.
All the WD drives are presenting the error. The fourth one is actually configured as a hot spare, but it behaves in the same way if part of the raid.
Output of the cat command is the same as the first one I posted before.
Here's the current situation.
It's quite a frustrating situation at this point as it seems something is really wrong with the Wd drives.

I had two spares old Seagate 750 Gb each that I tested not that long ago in this configuration and didn't have any issues.
Everything started when I first mounted the first couple of WD (the older EFAX).
EFRX were bought brand new last week.

flyride · March 9, 2021

I'd be looking at dmesg and cat /var/log/messages to see why the drives are going offline. This is definitely odd.

Not super likely, but this could be a cabling, controller or port problem.

I only recommend doing this when the arrays are clean, but swapping a "good" port with a "bad" one (i.e. Drive 4 and 5) it would be interesting to see if the issue follows the drive or stays on the port.

EDIT: the fact that the main array (md2) is unaffected is perplexing as well

Edited March 9, 2021 by flyride

FaberCaster · March 10, 2021

Hi Flyride.
I tend to exclude your guessing just becase I already tried everything you've mentioned in the last post.
Original drive configuration was the following:

2x2TB seagate.

2x750Gb seagate.

Both attached to a 4 ports sata pci controller based on a marvell 9215 chip.

Once I backed up all the data from my old nas, I plugged the 2x2Tb WD EFAX on the MB's sata connectors.

And I got the system partition failed only over the WD.

In the meantime I got a Dell Perc H210 that I flashed to IT mode and the other 2 WD EFRX.
I wiped all the partitions on the 2 Tb drives, plugged in the flashed Perc and connected the 2 Tb drives (the old seagate 750 Gb are parked on my desk) using the SAS/SATA cable the Perc came in with.

So, bad cables I tend to say excluded.

Drives swapped across the SATA ports done a well.

I had a look ad dmesg and cat command output but it's extremely huge.

What I noticed is a message that says "invalid raid superblock magic"

I'm frankly speaking running out of options and already initiated the return procedure for the newest WD (EFRX).
I'll swap them with two Seagate and see if the problem gets fixed.

amikot · March 11, 2021

Have you tried to not use other drives but WD only?
Have you checked if there is newer BIOS for your motherboard/SATA controller?
Have you tried to disable cache on these drives in Drives Manager of DSM?

BTW. I have somehow similar configuration of Raid5, but also more mixed:

1x ST2000VM003
1x WD20EFRX
2x HCS5C2020ALA632

No problems at all, so in your case it may be even a kind of controller issue.
Maybe SATA mode unsupported by WD drives?
In my N54L server I was setting SATA to 3GB/s or something ( I was going with guide, so don't know exactly why some settings is set this or that way).
But if it's controller fault, why it's just md0 affected? Maybe it's question that md0 isn't actually raid5? md0 as system partition is mirror type - so it's rather raid1. Maybe Raid5 can work with your controller/drives combination, but Raid1 can't?
I would say it disabling cache may help but will probably decrease performance on data partition.

Edited March 11, 2021 by amikot

FaberCaster · March 12, 2021

@amikot

Thanks for the reply.
Here are my reply to your questions.

"Have you tried to not use other drives but WD only?"

Yes, done that. No different results.
"Have you checked if there is newer BIOS for your motherboard/SATA controller?"

Bios is the latest available, beside all I'm using a PERC controller flashed to IT mode and not the sata on the MB.
"Have you tried to disable cache on these drives in Drives Manager of DSM?"
Cache is already disabled on all drives.

FaberCaster · March 12, 2021

Update #3.

So, I removed the EFAX WD.

Wiped the partitions on the other 4 drives.

Reinstalled everything from scratch.

Swapped SATA and power wires on the drives.

Loader 1.04b DS918+ 6.2.3-25426 Update 3.

Created the array (Raid 5) and the storage pool.

Once the parity consistency check finishes everything seems to be normal. But couple of hours later I got the same error (failed to access system partition).

And always on the WD drives 🤬

I hit the repair link... parity consistency check rerun.

Once done I powered the system off, swapped the SATA cables again (I underline that the controller is a SAS\SATA, it has two ports for a total of 8 drives max).
So the WD which were originally on the port #2 where moved to port#1 while seagate drives went to port#2.

System ok for about a hour or so, then same old s##t again.

I'll order two seagate on monday and kick definitely off the WD from the system.

Let's see if that helps...

amikot · March 12, 2021

Did you try to use controller onboard for at least one of WD drives? I think problem may be related to your card controller features - maybe there is conflict between software RAID1 of MD0 and hardware RAID1 capabilities of your controller card that are only causing problems with WD disks?
Maybe there is option to switch hardware raid completely off?

System Partition Failed on WD RED

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation