Repairing a crashed BTRFS volume

NeoID · September 7, 2017

A bad cable crashed my BTRFS volume the other day. I found the issue after a while and started the rebuild process after replacing the bad SATA cable. The rebuild took a few days to complete, but once it did, it restarted from 0% again. I have not received any other emails or notifications from DSM that anything is abnormal. Is it normal that the rebuild process need multiple passes to complete?

I'm on ESXi and the boot drive is mounted as read-only once everything was up and running smoothly. First I thought the the rebuild restarted because the state couldn't be written back to the boot drive (I don't think it does), but it has been read-only since before the crash.. so I don't think it's related to this problem.

Anyone who has been in this situation before and knows how many passes the rebuild process needs? I would never believe that it requires more than one, but who knows....

aol · September 7, 2017

Can't really answer your question but on a related note, I've had 3 WD Red 3TB drives fail in the last 6 months with bad sectors, in a btrfs RAID 5 array. It seems impossible to believe that 3 drives went bad, they all have < 1000 hours on them. I've RMAed them back to WD so that's good. Meanwhile, I'm wondering if there is some issue with btrfs that's causing the bad sectors. I know, it seems unlikely. But I never had issues with ext4 RAID 5 arrays.

Anyway, you say a bad cable crashed your btrfs volume, so now I'm wondering if I have a bad SATA cable. How would one tell? Although the drives that have failed have been across the array - disk 2 most recently, disk 1 before that, so that would see to eliminate a single bad SATA cable as the culprit. But tell me how did you diagnose a bad SATA cable?

NeoID · September 8, 2017

13 hours ago, aol said:

Can't really answer your question but on a related note, I've had 3 WD Red 3TB drives fail in the last 6 months with bad sectors, in a btrfs RAID 5 array. It seems impossible to believe that 3 drives went bad, they all have < 1000 hours on them. I've RMAed them back to WD so that's good. Meanwhile, I'm wondering if there is some issue with btrfs that's causing the bad sectors. I know, it seems unlikely. But I never had issues with ext4 RAID 5 arrays.

Anyway, you say a bad cable crashed your btrfs volume, so now I'm wondering if I have a bad SATA cable. How would one tell? Although the drives that have failed have been across the array - disk 2 most recently, disk 1 before that, so that would see to eliminate a single bad SATA cable as the culprit. But tell me how did you diagnose a bad SATA cable?

I also use 3TB WD Red drives (10x ) in RAID 6. First of all I powered down my system and tested the drive on another machine. Since it was fine I didn't think that the crash was caused by it. To make sure, I've replaced the drive with another WD Red 3TB I had lying around. Also this drive "crashed" according to DSM after rebuilding for a day. I started noticing a pattern here. Every time I've started from scratch it has been the same disk bay that DSM was complaining about. The reason I didn't notice this sooner is that DSM sometimes changes the order of the disk (I don't remember if it happens on each boot or each install of DSM). However.. it was always the same bay causing issues. I finally removed the drive from the bay and connected it directly by SATA. For now everything seems to rebuild just fine (even though I don't see why it's currently rebuilding for the second time without any errors in between).

I have also noticed that the HBA became very hot when rebuilding (and under heavy load), so that may also have something to do with this issue... So I've 3d printed a bracket and mounted a 120mm fan on top of the card. For now everything works smooth, so I just hope it doesn't restart the rebuild a third time or I think there's something wrong with DSM not being able to save the state of the volume somehow...

Edited September 8, 2017 by NeoID

aol · September 8, 2017

Thanks @NeoID I understand how you troubleshot the issue. Appreciate the follow-up. Every time I've had a drive go, I shut down, replace, boot up, repair, and it takes a single pass. I have experience on both ext4 RAID 5 and btrfs RAID 5.

I still don't know what to make of 3 WD Reds going bad with < 1000 hours each on them. If I lose another drive I'll probably wipe the array and rebuild it as an ext4 RAID 5. I'm on a build-a-box with a Gigabyte mobo, Intel CPU, 6GB of good RAM, and the 4x3TB Reds on the SATA ports, so all good-quality stuff, not sure if there's some sort of weird bit swapping that could be causing bad sectors on the drives. Just don't know what to make of it, but to me the biggest culprit is the relatively new btrfs. Anyway. Peace.

NeoID · September 8, 2017

1 hour ago, aol said:

Thanks @NeoID I understand how you troubleshot the issue. Appreciate the follow-up. Every time I've had a drive go, I shut down, replace, boot up, repair, and it takes a single pass. I have experience on both ext4 RAID 5 and btrfs RAID 5.

I still don't know what to make of 3 WD Reds going bad with < 1000 hours each on them. If I lose another drive I'll probably wipe the array and rebuild it as an ext4 RAID 5. I'm on a build-a-box with a Gigabyte mobo, Intel CPU, 6GB of good RAM, and the 4x3TB Reds on the SATA ports, so all good-quality stuff, not sure if there's some sort of weird bit swapping that could be causing bad sectors on the drives. Just don't know what to make of it, but to me the biggest culprit is the relatively new btrfs. Anyway. Peace.

I'm afraid you'r right about the rebuild taking only a single pass and something is wrong with my array... -_-

Do you use a hypervisor or run baremetal? I have had these issues even on EXT4, so I doubts it's related to the file system.

aol · September 8, 2017

I run baremetal, build-a-box Gigabyte mobo, Intel CPU, RAM and the WD Reds (there is a GPU plugged into the PCIE slot but it's not powered, my understanding is this mobo is a desktop model and requires a GPU but fortunately boots without having to actually output graphics, and of course the cpu has a GPU built in).

I don't know what a hypervisor is, so I don't know if I use one. If that implies a VM, then no, I don't use a hypervisor.

IG-88 · September 9, 2017

hi,

you can look into the logs (/var/log/) to find information about the second rebuild

also you can have a peek into the state of your raid with

cat /proc/mdstat

NeoID · September 9, 2017

I've been looking at dmesg and /proc/mdstat for the last couple of days. Today the rebuild did a second pass and now everything seems to be clean and working again!

I guess my conclusion is.. make sure all cables are properly inserted and cool your HBA card.

Sign In

Repairing a crashed BTRFS volume

Recommended Posts

NeoID

Link to comment

Share on other sites

aol

Link to comment

Share on other sites

NeoID

Link to comment

Share on other sites

aol

Link to comment

Share on other sites

NeoID

Link to comment

Share on other sites

aol

Link to comment

Share on other sites

IG-88

Link to comment

Share on other sites

NeoID

Link to comment

Share on other sites

Join the conversation

Forums

What's new

MUST READ

Members