SHR2/BTRFS array degraded after adding a disk

Darkened · November 12, 2017

Hey,

I'm running XPEnology DSM 6.0.2-8451 Update 11 on a self built computer.

I started out with 4 x 1Tb older Samsung drives (HD103UJ & HD103SJ). These are in SHR2/BTRFS array (enabled SHR for DS3615xs). This setup hasn't had any issues and I intended to expand the array with other 1Tb drives, but I decided to go with bigger drives since I had the chance to do so. So I added a 3Tb WD Red and started expanding the volume. The main goal was to replace the 1Tb drives one by one with 3Tb drives and have 5x 3Tb WD Reds in the end.

The expansion went ok and so did the consistency check. Then for some unknown reason the newly added disk was restarted, then degraded the swap system volume, then degraded the volume, was "inserted" and "removed" (although I didn't do anything) and finally it degraded the root system volume on the disk. I tried repairing the volume, but it didn't help.

I shut the server down and no new data has been written on the array since this issue.

Yesterday I finally had time to do something about this, so I removed the disk, emptied everything on the disk and re-inserted it in the XPEnology server.

I also changed the SATA-cable and power cable for the disk.

Repair was successful like the array expansion before and so was the consistency check. After this had finished, I started the RAID scrub.

Even the scrub went through just fine at 3:28:01 and then it did just about the same as when I was expanding the array. Disk restarted due to "unknown error", volumes degraded and the disk 5 was inserted and removed from the array.

5a0844317621b_Screenshot-2017-11-12(3).png.f9b03d65f265d30746a8db40d751f5af.png

This is the situation now along with:

5a0844321b291_Screenshot-2017-11-12(4).png.350adff711a1e6e4cd49cbb99316aeb3.png

Next step is of course to run diagnostics on that WD Red, but for some reason I don't think it's the disk that is causing this issue. I also have a few other WD Reds which I could be able to try out, but I'd need to empty them first.

If you have some inkling on what could be causing this, It'd be appreciated.

Best regards,

Darkened aka. Janne

IG-88 · November 12, 2017

if the disk turns out to be ok it might be a software flaw in dsm, possibly fixed in dsm 6.1

if you write what hardware you use i can telly ou if its compatible (as there are fewer drivers for 6.1 availible then for 6.0.2)

btw. you can also try to check the "real" log files in /var/log/ when using putty/ssh, maybe you will see more information about the "unknown" error of disk5

Darkened · November 12, 2017

Hey IG-88,

I'm running the Asus E35M1-M without any add-ons. Other than that, 4Gb of G.Skill DDR3 ram and Corsair PSU.

The mobo is running the latest available bios.

DLG is running as I'm writing this and at least the short test went through without an issue. I'll update the result once the test completes.

I'll also check the proper log files after the test is done. Can't get the nas on the table at the moment

Janne

IG-88 · November 12, 2017

amd cpu can be a problem, in hp servers people have to deactivate C1E in bios to get it working, so may work or not, you have to try

according to this the A50 chipset storage should be supported be the AHCI default driver (have to be set to ahci mode in bios)

https://ata.wiki.kernel.org/index.php/SATA_hardware_features

and the NIC is e realtek 8111E, also supported by jun loader 1.02b

so the only problem seems to be the cpu and if its not working you will instantly know as the system will not boot the loader propperly and you will not find the system with sysnology assistant when booted with a usb flash drive with jun loader 1.02b

if you want tto check the complete install just disconnect all disks and add a empty disk (no partitions on it) and install dsm 6.1.3 it that work and all your plugins are availible you can do the same with the original disks and migrate/upgrade the system all/most settings and plugins are usualy working after that

but you should read the howto for update to 6.1 before trying this

Darkened · November 15, 2017

Hey IG-88,

Sorry about the delay. I couldn't test your suggestions before today, but here goes.

The WD Red run through DLG-tests without any issues. This was done a few days ago.

For C1E state, I don't think it exists in the bios of this mobo. C6 is enabled, but C1E is nowhere to be found. Although I must say that the server works just fine, so if the C1E issue is more with getting the Xpenology to boot, then it's not the issue here. I've been using the server with 4 x 1Tb drives a while now without any problems.

The hard drives are and have been in AHCI-mode from the beginning. And I haven't had any issues connecting to the server, so I don't think there's anything wrong with the NIC.

I'll boot up the server next and try to go through the logs via Putty. I'll get back to you with the result from those.

Janne

Darkened · November 15, 2017

Hey again,

I do think I found the issue from the log (disk_log.xml for future reference).

This was the first try on 2017-10-01 when I tried expanding the array:

<kernel time="2017/10/01 03:58:04" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="590338" show="0">RecovComm Persist PHYRdyChg 10B8B </ker                                                               nel>
  <kernel time="2017/10/01 04:18:48" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <hotplug time="2017/10/01 04:39:55" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" show="1">plugout</hotplug>
  <hotplug time="2017/10/01 04:39:55" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" show="1">plugin</hotplug>
  <hotplug time="2017/10/01 05:01:59" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" show="1">plugout</hotplug>
  <hotplug time="2017/10/01 05:01:59" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" show="1">plugin</hotplug>
  <kernel time="2017/10/01 05:31:08" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66050" show="0">RecovComm Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 05:52:41" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 06:13:44" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 06:36:21" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 06:57:24" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 07:18:03" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 07:39:06" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 07:59:42" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 08:20:22" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <hotplug time="2017/10/01 08:41:23" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" show="1">plugout</hotplug>
  <hotplug time="2017/10/01 08:41:23" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" show="1">plugin</hotplug>
  <kernel time="2017/10/01 09:02:25" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66050" show="0">RecovComm Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 09:23:28" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 09:44:07" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 10:06:22" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/10/01 10:27:25" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>

And the second time on 2017-11-12:

<kernel time="2017/11/12 03:50:24" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="590338" show="0">RecovComm Persist PHYRdyChg 10B8B </ker                                                               nel>
  <hotplug time="2017/11/12 04:11:05" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" show="1">plugout</hotplug>
  <hotplug time="2017/11/12 04:11:05" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" show="1">plugin</hotplug>
  <kernel time="2017/11/12 04:32:05" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66050" show="0">RecovComm Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 05:06:43" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 05:27:45" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 05:48:49" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 06:09:25" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 06:30:13" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 06:56:13" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 07:17:42" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 07:38:44" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 08:00:24" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 08:34:54" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 08:55:56" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 09:16:37" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 09:38:24" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 09:59:00" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 10:19:43" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 10:40:28" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>
  <kernel time="2017/11/12 11:02:24" path="/dev/sde" model="WD30EFRX-68EUZN0" SN="WD-WMC4N1830023" type="serror" raw="66048" show="0">Persist PHYRdyChg </kernel>

The same issue repeated, but without the several "plugout/plugin" events.

After this revelation I found some reference from the unRaid wiki.

So it seems that the PHYRdyChg / 10B8B-errors are most probably due to bad connections from either the SATA-cable or the SATA power cable. The difference between the two tries was that I changed the SATA-cable to another one and I didn't get any CRC-errors on either time. This leads me to the SATA power cable/plug or HD connector. I didn't change the power cable between the two tries.

There is two other possibilities still, namely too many hard drives on one lead/rail (although the WD Red was closest to the PSU). I also had HD hibernation on from DSM, which could mimic the hotplug events if the drive doesn't spool up fast enough etc. This could well be the culprit here, since the drive was ok until the expansion / repair was done, which took several hours both times.

So according to this research:

Faulty SATA cable (changed, no difference)
Faulty SATA power cable / plug (to be tested)
Faulty SATA connector on drive (visual inspection ok)
Too many drives on one lead / rail
Software issue with WD Red drives and hibernation

For now I've disabled HD hibernation and I'll plug the drive to another slot / cable and proceed with the repair. I'll report back after the server has run the repair overnight.

Janne

Darkened · November 16, 2017

Morning update:

The array repair was successful and no weird behavior has occurred after the repair finished.

Just now I started raid scrubbing, which was suggested by DSM and after that it will run the file system check.

I'll report back after those are done, but I'm cautiously optimistic about this now

Janne

Darkened · November 17, 2017

Final update (for now).

The array repair, RAID scrubbing and the file system de-fragmentation went through without a hitch.

Sadly I did two things at once (changed the drive bay and thus a different sata power cable for the drive and disabled the HD hibernation feature from DSM). One of those fixed the issue, but I just didn't want to mess around with a "production" server by testing these things one by one.

Next up I'll expand the server by swapping one of the 1Tb drives with a 3Tb WD Red. Hopefully everything goes well with that

Big thanks to IG-88 for pointing me to the right direction!

Janne

Polanskiman · November 18, 2017

The question(s) in this topic have been answered and/or the topic author has resolved their issue. This topic is now closed. If you have other questions, please open a new topic.

Sign In

SHR2/BTRFS array degraded after adding a disk

Question

Darkened

Link to comment

Share on other sites

8 answers to this question

Recommended Posts

IG-88

Link to comment

Share on other sites

Darkened

Link to comment

Share on other sites

IG-88

Link to comment

Share on other sites

Darkened

Link to comment

Share on other sites

Darkened

Link to comment

Share on other sites

Darkened

Link to comment

Share on other sites

Darkened

Link to comment

Share on other sites

Polanskiman

Link to comment

Share on other sites

Forums

What's new

MUST READ

Members