btrfs causing crashed drives?

aol · August 26, 2017

I've had a mixture of WD Red drives in a Syno DS410 and an Intel SSE4200 enclosure running Xpenology for years with very few drive issues. Recently I thought I'd repurpose an Intel box I'd built a few years ago but was just sitting there (CPU/RAM/MOBO) and successfully set it up with 4x3TB WD Red drives running Xpenology. When given the choice, I chose to create a btrfs RAID 5 volume.

But. In the 5 or so months I've been running this NAS, three drives have crashed and started reporting a bunch of bad sectors. These drives have less than 1000 hours on them, practically new. Fortunately they are under warranty at some level. But, still, wondering, could this be btrfs? I'm no file system expert. Light research suggests that while btrfs has been around for several years and of course is a supported option in Synology, some feel it isn't ready for prime time. I'm at a loss to explain why 3 WD Red drives with less than 1000 hours on them manufactured on different dates are failing so catastrophically. I understand btrfs and bad sectors are not really in the same problem zone; software shouldn't be able to cause hardware faults. I considered heat but these drives are rated at 65 celsius and they are not going above 38 or so. If it matters, when drives fail, the drive always reports problems at boot up; in fact, as the volume is now degraded with the loss of yet another drive I'm just not turning the system off until I get a new drive in there; one of the remaining drives reported failure to start up properly in the past week.

Final consideration I have is that this is a build-a-box using a Gigabyte motherboard and 4 drives on the SATA bus in AHCI mode. Some sort of random hardware issue in this system could possibly be causing bad sectors to be reported on the drives?? Seems unlikely. Has anyone ever heard of SynologyOS reporting bad sectors when there weren't actually bad sectors?

Anyone have any thoughts on this? Should I go back to ext4? This is mainly a plex media server.

totenkopf4 · August 26, 2017

aol

Sorry to hear of the troubles but I wanted to comment on your problem.

Trust me you'r not alone with these types of problems.

WD drives come in many flavors. BLACK, BLUE, RED, GOLD, GREEN etc each one serves a purpose. In your case the Reds are defined as NAS drives whereas the GOLD are for Data Center use. What's the diff well not much but that's not the question. As the issues you are seeing relating to the problem requires questions to be asked.

Such as in the SMART tables that each drive has recorded what are some of the pertinent information outside of power on hours.

When I look at SMART tables I look for the several things; start /stop, spin up time, raw read error rate, reallocated sector count, drive power cycle count, and the most important is Load Cycle Count ; these are a few of the data values that I look at when evaluating issues w HDD.

As you may know almost all HDD manufactures have a set up in the firmware the ability of the HDD to power down at its discretion. This is ok sometimes but may be problematic in situations like your have or in conditions of use like the use of btrfs, So if your start / stop count is very high this could mean the the firmware in the HDD is stopping the motor or parking heads in these cases the drive needs more time to get ready - now server controllers have abilities to regulate this action. But PC's do not.

Unrelated to this function, several years ago WD created a feature called "IntelliPark" on its new line of GREEN HDD. This feature was carried over into the RED line as well. That feature was called IntelliPark!

"With IntelliPark, the disks can position the read/write heads in a parked position, unloaded. This was meant to reduce power consumption, however it was originally set in a very aggressive way (with a timeout of just 8 seconds), and under certain circumstances the disks would constantly switch in and out of that idle state. That is not uncommon on a NAS, especially if based on Linux, where the system wakes the disks every 10-20 seconds to write logs, to collect offline SMART status (which triggers the heads out of their parked position) or for other reasons."

The above paragraph was taken from this site which deals about your problem (potentially)

https://withblue.ink/2016/07/15/what-i-learnt-from-using-wd-red-disks-to-build-a-home-nas.html

That above link also deals with how to stretch out the IntelliPark timeout to lessen this occurrence of head parking.

Be advised that the above site also mentions another function that I would stay clear of. As I am not sure if the Synology OS supports the feature and we don't want to induce more problems than what we already have.

Since the HDD you have have been in service for a while now (years), it very well may be due for replacement due to this situation and that these drives are on the fringe of needing replacement. (excessive Load Cycle count)

But only an extensive test procedure done offline (another computer) can determine this situation, IE running WD diagnostics or other long term testing software, plus looking at the SMART tables for info on the condition of the HDD at start of test during and at end.

Hope this helps/

aol · August 28, 2017

On 8/26/2017 at 11:52 AM, totenkopf4 said:

aol

Sorry to hear of the troubles but I wanted to comment on your problem.

Trust me you'r not alone with these types of problems.

totenkopf4 thank you so much for replying, and for the detail. I am aware of the SMART tests and I do use them but was more infrequent.

On my DS410 I have 4x2TB WD Reds with 10s of thousands of hours of them, each, no issues. On this xpenology system I have 4x3TB WD Reds and most are almost brand new with < 1000 hours.

So for example I recently had an issue with disk 3, couple weeks ago it reported I/O errors, but it now seems normal. With the drives I replaced I was getting bad sectors reported. So on disk 3, I see load cycle count, as follows:

value: 200

worst: 200

threshold: 0

status: OK

raw data: 67

I have no idea how to interpret that but I'll start googling.

On disk1, which is brand new, for load cycle count I see 200/200/0/OK/21, so the only difference is "raw data". On disk4, my "oldest" drive in this system, I see 200/200/0/OK/845, again, only difference being in raw data column. I need to research this.

I certainly understand that drives have firmware which regulates their behavior, should I look for ways to "tune" the behavior? I would think that Synology would have figured that part out.

I had 3TB Reds in my other xpenology box with thousands of hours a piece on them no issues. moved them to this box and reinitialized from an ext4 raid5 to a btrfs raid5 and the drives keep failing, weird.

I'll look at that link and look at doing WD diagnostics. I understand what you say about btrfs using different timeouts. As far as I know bad sectors is not a timeout-related thing? Very frustrated by this recent experience.

Thanks again.

Sign In

btrfs causing crashed drives?

Recommended Posts

aol

Link to comment

Share on other sites

totenkopf4

Link to comment

Share on other sites

aol

Link to comment

Share on other sites

Join the conversation

Forums

What's new

MUST READ

Members