Degraded volume, can't replace disk -> volume crashed (ext4, SHR)


Recommended Posts

So, I'm a bit stuck. Over the last years when a disk crashed, I'd pop the crashed disk out, put in a fresh one and repair the volume. No sweat. Over time I've migrated from a mixed 1 and 2TB disks to all 3TB one and earlier this year I'd received a 6TB one as a RMA for a 3TB one from WD. So I was running:

  • Disk 1: 3TB WD - Crashed
  • Disk 2: 3TB WD
  • Disk 3: 6TB WD
  • Disk 4: 3TB WD

 

So I've bought a shiny new 6TB WD to replace disk 1. But that did not work out well. When running the above setup; I've a degraded volume, but it's accessable. When putting in the new one, the volume is crashed and the data is not accessable. I'm not able to slect the repair option; only 'Create'. When putting back the crashed disk, after a reboot the volume is accessable again (degraded). I've did a (quick) SMART test on the new drive and it seems OK.

The only thing that's changed since the previous disk replacement is that I've did an upgrade to 6.2.3-25426. Could that be a problem? 

However, before we proceed (and I waste your time), I've got backups. Of everything. So, deleting and recreating a new volume (or the entire system) would not be that big of a deal; It'll only cost me a lot of time. But that seems the easy way to go. I've got a Hyper-Backup of the system on an external disk and all the apps and a few of the shares. The bigger shares are backed up to several other external disks. If I delete the volume, create a new one, restore the last Hyper-Backup and copy back the other disks, I'm golden? The Hyper-Backup will contain the original shares?

 

Might that not be the best/simplest solution, I've attached some screenshots and some outputs so we know what we are talking about..

 

Thanks!

 

Spoiler



rowdy@prime-ds:/$ sudo cat /proc/mdstat
Password:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md3 : active raid5 sda6[0] sdd6[4] sdc6[5]
      2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [U_UU]

md2 : active raid5 sdb5[1] sdc5[4] sdd5[5]
      5846050368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [_UUU]

md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdb1[1] sdc1[3] sdd1[2]
      2490176 blocks [12/3] [_UUU________]

unused devices: <none>


rowdy@prime-ds:/$ sudo vgdisplay -v
    Using volume group(s) on command line.
  --- Volume group ---
  VG Name               vg1000
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  11
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               8.17 TiB
  PE Size               4.00 MiB
  Total PE              2142645
  Alloc PE / Size       2142645 / 8.17 TiB
  Free  PE / Size       0 / 0
  VG UUID               DXrwth-QERd-WcGx-B2aL-kgfR-TcEM-VcNKfe

  --- Logical volume ---
  LV Path                /dev/vg1000/lv
  LV Name                lv
  VG Name                vg1000
  LV UUID                fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 1
  LV Size                8.17 TiB
  Current LE             2142645
  Segments               4
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           252:0

  --- Physical volumes ---
  PV Name               /dev/md2
  PV UUID               W53sd5-d2wI-jQLB-0qZF-Ncz9-Ex7Y-QLScaw
  PV Status             allocatable
  Total PE / Free PE    1427258 / 0

  PV Name               /dev/md3
  PV UUID               AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd
  PV Status             allocatable
  Total PE / Free PE    715387 / 0



rowdy@prime-ds:/$ sudo lvdisplay -v
    Using logical volume(s) on command line.
  --- Logical volume ---
  LV Path                /dev/vg1000/lv
  LV Name                lv
  VG Name                vg1000
  LV UUID                fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 1
  LV Size                8.17 TiB
  Current LE             2142645
  Segments               4
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           252:0






rowdy@prime-ds:/$ sudo lvm pvscan
  PV /dev/md2   VG vg1000   lvm2 [5.44 TiB / 0    free]
  PV /dev/md3   VG vg1000   lvm2 [2.73 TiB / 0    free]
  Total: 2 [8.17 TiB] / in use: 2 [8.17 TiB] / in no VG: 0 [0   ]
  
  
  
rowdy@prime-ds:/$ sudo cat /etc/fstab
none /proc proc defaults 0 0
/dev/root / ext4 defaults 1 1
/dev/vg1000/lv /volume1 ext4 usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0,synoacl,relatime,relatime_period=30 0 0

 

 

Spoiler


rowdy@prime-ds:/$ sudo cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md3 : active raid5 sdd6[4] sdc6[5]
      2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/2] [__UU]

md2 : active raid5 sdb5[1] sdc5[4] sdd5[5]
      5846050368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [_UUU]

md1 : active raid1 sdd2[3] sdc2[2] sdb2[1]
      2097088 blocks [16/3] [_UUU____________]

md0 : active raid1 sdb1[1] sdc1[3] sdd1[2]
      2490176 blocks [12/3] [_UUU________]

unused devices: <none>


rowdy@prime-ds:/$ sudo mdadm --detail /dev/md2
Password:
/dev/md2:
        Version : 1.2
  Creation Time : Thu Sep 27 21:57:37 2018
     Raid Level : raid5
     Array Size : 5846050368 (5575.23 GiB 5986.36 GB)
  Used Dev Size : 1948683456 (1858.41 GiB 1995.45 GB)
   Raid Devices : 4
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Mon Dec 13 10:31:12 2021
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : prime-ds:2  (local to host prime-ds)
           UUID : 8b09d0a0:b176c408:82b17a97:a84d1b39
         Events : 4579924

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       21        1      active sync   /dev/sdb5
       5       8       53        2      active sync   /dev/sdd5
       4       8       37        3      active sync   /dev/sdc5



rowdy@prime-ds:/$ sudo vgdisplay -v
    Using volume group(s) on command line.
  /dev/md3: read failed after 0 of 4096 at 0: Input/output error
    Wiping cache of LVM-capable devices
    /dev/md3: read failed after 0 of 4096 at 0: Input/output error
  /dev/md3: read failed after 0 of 4096 at 3000554160128: Input/output error
  /dev/md3: read failed after 0 of 4096 at 3000554217472: Input/output error
    /dev/md3: read failed after 0 of 4096 at 0: Input/output error
  /dev/md3: read failed after 0 of 4096 at 4096: Input/output error
    /dev/md3: read failed after 0 of 4096 at 0: Input/output error
  Couldn't find device with uuid AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd.
    There are 1 physical volumes missing.
    There are 1 physical volumes missing.
  --- Volume group ---
  VG Name               vg1000
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  11
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                2
  Act PV                1
  VG Size               8.17 TiB
  PE Size               4.00 MiB
  Total PE              2142645
  Alloc PE / Size       2142645 / 8.17 TiB
  Free  PE / Size       0 / 0
  VG UUID               DXrwth-QERd-WcGx-B2aL-kgfR-TcEM-VcNKfe

  --- Logical volume ---
  LV Path                /dev/vg1000/lv
  LV Name                lv
  VG Name                vg1000
  LV UUID                fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 1
  LV Size                8.17 TiB
  Current LE             2142645
  Segments               4
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           252:0

  --- Physical volumes ---
  PV Name               /dev/md2
  PV UUID               W53sd5-d2wI-jQLB-0qZF-Ncz9-Ex7Y-QLScaw
  PV Status             allocatable
  Total PE / Free PE    1427258 / 0

  PV Name               unknown device
  PV UUID               AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd
  PV Status             allocatable
  Total PE / Free PE    715387 / 0



rowdy@prime-ds:/$ sudo lvdisplay -v
    Using logical volume(s) on command line.
  /dev/md3: read failed after 0 of 4096 at 0: Input/output error
    Wiping cache of LVM-capable devices
    /dev/md3: read failed after 0 of 4096 at 0: Input/output error
  /dev/md3: read failed after 0 of 4096 at 3000554160128: Input/output error
  /dev/md3: read failed after 0 of 4096 at 3000554217472: Input/output error
    /dev/md3: read failed after 0 of 4096 at 0: Input/output error
  /dev/md3: read failed after 0 of 4096 at 4096: Input/output error
    /dev/md3: read failed after 0 of 4096 at 0: Input/output error
  Couldn't find device with uuid AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd.
    There are 1 physical volumes missing.
    There are 1 physical volumes missing.
  --- Logical volume ---
  LV Path                /dev/vg1000/lv
  LV Name                lv
  VG Name                vg1000
  LV UUID                fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 1
  LV Size                8.17 TiB
  Current LE             2142645
  Segments               4
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           252:0
  


rowdy@prime-ds:/$ sudo lvm pvscan
  /dev/md3: read failed after 0 of 4096 at 0: Input/output error
  /dev/md3: read failed after 0 of 4096 at 3000554160128: Input/output error
  /dev/md3: read failed after 0 of 4096 at 3000554217472: Input/output error
  /dev/md3: read failed after 0 of 4096 at 4096: Input/output error
  Couldn't find device with uuid AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd.
  PV /dev/md2         VG vg1000   lvm2 [5.44 TiB / 0    free]
  PV unknown device   VG vg1000   lvm2 [2.73 TiB / 0    free]
  Total: 2 [8.17 TiB] / in use: 2 [8.17 TiB] / in no VG: 0 [0   ]



rowdy@prime-ds:/$ sudo cat /etc/fstab
none /proc proc defaults 0 0
/dev/root / ext4 defaults 1 1
/dev/vg1000/lv /volume1 ext4 usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0,synoacl,relatime,relatime_period=30 0 0

 



 

 

1-OLD_SITUATION_DEGRADED_STORAGE_POOL.png

2-OLD_SITUATION_DEGRADED_HDD-SDD.png

3-NEW_SITUATION_CRASHED_VOLUME.png

4-NEW_SITUATION_CRASHED_STORAGE_POOL.png

5-NEW_SITUATION_CRASHED_HDD-SDD.png

Edited by Rowdy
Typos, added detailed information about md2
Link to post
Share on other sites

There are three different crashed entities on your system right now.

 

1. "Disk 1" physical drive - DSM is reporting it as crashed, but it is still working, at least to an extent.

2. /dev/md3 array within your SHR - which is missing /dev/sda6 logical drive 1

3. /dev/md2 array within your SHR - which is missing /dev/sdb5 logical drive 2

 

The #1 issue is what you think you need to fix by replacing a bad drive.

The problem is that /dev/md2 is critical because of a loss of redundancy on another drive, not the one you are trying to replace.  SHR needs both /dev/md2 and /dev/md3 to work in order to present your volume. When you try and replace the disk, /dev/md2 cannot start.

 

You need to attempt to repair the /dev/md2 array with the disk you have prior to replacing it.  DSM may not let you do this via the GUI because it thinks the physical disk is crashed. You can try and manually resync /dev/md2 with mdadm as root:

 

# mdadm /dev/md2 --manage --add /dev/sdb5

 

EDIT: add target above should have been /dev/sda5

Edited by flyride
wrong partition selected
Link to post
Share on other sites

Thanks! I've tried that, with the old drive, new drive, no drive and reboots, but they all say the same;

 

rowdy@prime-ds:/$ sudo mdadm /dev/md2 --manage --add /dev/sdb5
Password:
mdadm: Cannot open /dev/sdb5: Device or resource busy


How do find what's keeping the drive busy?

Link to post
Share on other sites

Unfortunatly, I can't do anything there I'm afraid. I can do health checks with both (OLD, NEW) drives in and hit 'Configure' with the OLD drive in. I've also checked what I could do on drive two, that shoudl have some errors as per your first reply, but no joy I'm afraid?

Link to post
Share on other sites

No problem, I was not able deduce that there where two arrays myself ;) 

Using the OLD drive, I could perform that command, and the storage manager seems to be checking parity now. Does that mean it's finding the loss in redundancy and it's going to correct that?

6-OLD-DISK-MDADM-ADDED-CHECKING-PARITY.png

Link to post
Share on other sites

Small update; after the check I could 'kick' the drive, however; I've did the add/parity check while the volume was not accessable.. 😒

So I could kick the drive, but could not add the new drive. So now I've booted with the old drive and an accessable volume, and restarted the add/parity check.. More news probably late this evening...

Link to post
Share on other sites

Ow well. That was a bust. So, after parity check it shows as degraded, but fine. Can access all files etc. I get occasionally a message thet the system partition is damaged, if i want to repair that. Yes, and it's fine. I don't have the option to deactivate drive 1 (I could deactivate 3..4, not a good plan ;) ) and If I yank it out the volume is crashed. If I enter the new drive, it won't show me the option to repair, probably because the volume is crashed? After that, if I put the old drive back, run the command you gave me (mdadm /dev/md2 --manage --add /dev/sda5) it wil start the check again and after A while I'm fine, with a busted drive.. All the commands I've put in seems a lot better, no errors and such..

 

One thing that's curious though, is that each time the parity check runs it runs faster (first time took 14 hours, second 8, last time 6 hours) but the bad sectors on drive one go waaaay up...

 

Any suggestions left, or should I give up and recreate the whole thing?

 

Spoiler

rowdy@prime-ds:/$ sudo cat /proc/mdstat
Password:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md3 : active raid5 sda6[0] sdd6[4] sdc6[5]
      2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [U_UU]

md2 : active raid5 sda5[6] sdb5[1] sdc5[4] sdd5[5]
      5846050368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sda1[0] sdb1[1] sdc1[3] sdd1[2]
      2490176 blocks [12/4] [UUUU________]

unused devices: <none>


rowdy@prime-ds:/$ sudo vgdisplay -v
    Using volume group(s) on command line.
  --- Volume group ---
  VG Name               vg1000
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  11
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               8.17 TiB
  PE Size               4.00 MiB
  Total PE              2142645
  Alloc PE / Size       2142645 / 8.17 TiB
  Free  PE / Size       0 / 0
  VG UUID               DXrwth-QERd-WcGx-B2aL-kgfR-TcEM-VcNKfe

  --- Logical volume ---
  LV Path                /dev/vg1000/lv
  LV Name                lv
  VG Name                vg1000
  LV UUID                fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 1
  LV Size                8.17 TiB
  Current LE             2142645
  Segments               4
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           252:0

  --- Physical volumes ---
  PV Name               /dev/md2
  PV UUID               W53sd5-d2wI-jQLB-0qZF-Ncz9-Ex7Y-QLScaw
  PV Status             allocatable
  Total PE / Free PE    1427258 / 0

  PV Name               /dev/md3
  PV UUID               AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd
  PV Status             allocatable
  Total PE / Free PE    715387 / 0
  
  
  
rowdy@prime-ds:/$ sudo lvdisplay -v
    Using logical volume(s) on command line.
  --- Logical volume ---
  LV Path                /dev/vg1000/lv
  LV Name                lv
  VG Name                vg1000
  LV UUID                fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 1
  LV Size                8.17 TiB
  Current LE             2142645
  Segments               4
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           252:0
  
  
  
rowdy@prime-ds:/$ sudo lvm pvscan
  PV /dev/md2   VG vg1000   lvm2 [5.44 TiB / 0    free]
  PV /dev/md3   VG vg1000   lvm2 [2.73 TiB / 0    free]
  Total: 2 [8.17 TiB] / in use: 2 [8.17 TiB] / in no VG: 0 [0   ]
  
  
  
rowdy@prime-ds:/$ sudo cat /etc/fstab
none /proc proc defaults 0 0
/dev/root / ext4 defaults 1 1
/dev/vg1000/lv /volume1 ext4 usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0,synoacl,relatime,relatime_period=30 0 0

 

 

Link to post
Share on other sites

What you are doing is repeatedly breaking /dev/md2 and repairing it (which is no longer necessary).  Your drive is failing so it keeps replacing bad sectors.

Let's stop doing that, because eventually you will run out of replacement sectors.

 

I'm concerned that we don't know which drive is actually experiencing the issue and we may be repairing the wrong array so that you can properly replace it.  If so we can issue a different mdadm command and repair the /dev/md3 array and try again. Please verify this by:

 

# smartctl -d sat --all /dev/sda | fgrep -i sector

Edited by flyride
Link to post
Share on other sites

I was aware of that, but I tried it three times because I thought I screwed up; ie with repairing it while the volume wasn't accessable..
 

rowdy@prime-ds:/$ sudo smartctl -d sat --all /dev/sda | fgrep -i sector
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct                                            0x0033   195   195   140    Pre-fail  Always       -       154
197 Current_Pending_Sector                                           0x0032   200   200   000    Old_age   Always       -       0

 

Link to post
Share on other sites

I will. Parity consistency check is running. Just the below command I presume? :)

 

rowdy@prime-ds:/$ sudo cat /proc/mdstat




And now I'm looking back,I've totally missed it there was something wrong... So this old output
 

md3 : active raid5 sda6[0] sdd6[4] sdc6[5]
      2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [U_UU]

 
Should become something like this if the parity consistency is ready?
 

md3 : active raid5 sda6[0] sdd6[4] sdc6[5]
      2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]


Trying to grasp what I'm doing, so next time I hopefully don't have to waste your time.. 😬 

Edited by Rowdy
Link to post
Share on other sites
51 minutes ago, Rowdy said:

I will. Parity consistency check is running. Just the below command I presume? :)

 



rowdy@prime-ds:/$ sudo cat /proc/mdstat

 

Yes, let's be totally certain of the array state prior to removing a disk (or disabling in the UI, or whatever).

 

51 minutes ago, Rowdy said:

Should become something like this if the parity consistency is ready?
 



md3 : active raid5 sda6[0] sdd6[4] sdc6[5]
      2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]


Trying to grasp what I'm doing, so next time I hopefully don't have to waste your time.. 😬 

 

Yep.  All three remaining drives (sdb, sdc, sdd) need to be current and active ("U") on both arrays prior to removing Drive 1 (sda).

 

Is your drive bay hot-swappable, or are you powering down between disk operations?

Edited by flyride
Link to post
Share on other sites

Yes, my drives are hot-swappable. :) The parity consistency check is done, and the volume has entered a new status; the warning status. That's a new one!  Mdstat also seems okay?

 

rowdy@prime-ds:/$ sudo cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md3 : active raid5 sdb6[6] sda6[0] sdd6[4] sdc6[5]
      2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md2 : active raid5 sda5[6] sdb5[1] sdc5[4] sdd5[5]
      5846050368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sda1[0] sdb1[1] sdc1[3] sdd1[2]
      2490176 blocks [12/4] [UUUU________]

unused devices: <none>

 

Link to post
Share on other sites

All looks good from here.

 

If the volume is currently accessible from the network, I would go ahead and attempt the drive replacement.

 

If the volume is not currently accessible, do a soft reboot and verify that the volume is accessible after the reboot.  Also check to make sure there is no change to the mdstat. If all is well that that point, then attempt the drive replacement.

  • Like 1
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.