Rowdy Posted December 13, 2021 #1 Posted December 13, 2021 (edited) So, I'm a bit stuck. Over the last years when a disk crashed, I'd pop the crashed disk out, put in a fresh one and repair the volume. No sweat. Over time I've migrated from a mixed 1 and 2TB disks to all 3TB one and earlier this year I'd received a 6TB one as a RMA for a 3TB one from WD. So I was running: Disk 1: 3TB WD - Crashed Disk 2: 3TB WD Disk 3: 6TB WD Disk 4: 3TB WD So I've bought a shiny new 6TB WD to replace disk 1. But that did not work out well. When running the above setup; I've a degraded volume, but it's accessable. When putting in the new one, the volume is crashed and the data is not accessable. I'm not able to slect the repair option; only 'Create'. When putting back the crashed disk, after a reboot the volume is accessable again (degraded). I've did a (quick) SMART test on the new drive and it seems OK. The only thing that's changed since the previous disk replacement is that I've did an upgrade to 6.2.3-25426. Could that be a problem? However, before we proceed (and I waste your time), I've got backups. Of everything. So, deleting and recreating a new volume (or the entire system) would not be that big of a deal; It'll only cost me a lot of time. But that seems the easy way to go. I've got a Hyper-Backup of the system on an external disk and all the apps and a few of the shares. The bigger shares are backed up to several other external disks. If I delete the volume, create a new one, restore the last Hyper-Backup and copy back the other disks, I'm golden? The Hyper-Backup will contain the original shares? Might that not be the best/simplest solution, I've attached some screenshots and some outputs so we know what we are talking about.. Thanks! Spoiler rowdy@prime-ds:/$ sudo cat /proc/mdstat Password: Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md3 : active raid5 sda6[0] sdd6[4] sdc6[5] 2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [U_UU] md2 : active raid5 sdb5[1] sdc5[4] sdd5[5] 5846050368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [_UUU] md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0] 2097088 blocks [16/4] [UUUU____________] md0 : active raid1 sdb1[1] sdc1[3] sdd1[2] 2490176 blocks [12/3] [_UUU________] unused devices: <none> rowdy@prime-ds:/$ sudo vgdisplay -v Using volume group(s) on command line. --- Volume group --- VG Name vg1000 System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 11 VG Access read/write VG Status resizable MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 2 Act PV 2 VG Size 8.17 TiB PE Size 4.00 MiB Total PE 2142645 Alloc PE / Size 2142645 / 8.17 TiB Free PE / Size 0 / 0 VG UUID DXrwth-QERd-WcGx-B2aL-kgfR-TcEM-VcNKfe --- Logical volume --- LV Path /dev/vg1000/lv LV Name lv VG Name vg1000 LV UUID fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW LV Write Access read/write LV Creation host, time , LV Status available # open 1 LV Size 8.17 TiB Current LE 2142645 Segments 4 Allocation inherit Read ahead sectors auto - currently set to 4096 Block device 252:0 --- Physical volumes --- PV Name /dev/md2 PV UUID W53sd5-d2wI-jQLB-0qZF-Ncz9-Ex7Y-QLScaw PV Status allocatable Total PE / Free PE 1427258 / 0 PV Name /dev/md3 PV UUID AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd PV Status allocatable Total PE / Free PE 715387 / 0 rowdy@prime-ds:/$ sudo lvdisplay -v Using logical volume(s) on command line. --- Logical volume --- LV Path /dev/vg1000/lv LV Name lv VG Name vg1000 LV UUID fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW LV Write Access read/write LV Creation host, time , LV Status available # open 1 LV Size 8.17 TiB Current LE 2142645 Segments 4 Allocation inherit Read ahead sectors auto - currently set to 4096 Block device 252:0 rowdy@prime-ds:/$ sudo lvm pvscan PV /dev/md2 VG vg1000 lvm2 [5.44 TiB / 0 free] PV /dev/md3 VG vg1000 lvm2 [2.73 TiB / 0 free] Total: 2 [8.17 TiB] / in use: 2 [8.17 TiB] / in no VG: 0 [0 ] rowdy@prime-ds:/$ sudo cat /etc/fstab none /proc proc defaults 0 0 /dev/root / ext4 defaults 1 1 /dev/vg1000/lv /volume1 ext4 usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0,synoacl,relatime,relatime_period=30 0 0 Spoiler rowdy@prime-ds:/$ sudo cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md3 : active raid5 sdd6[4] sdc6[5] 2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/2] [__UU] md2 : active raid5 sdb5[1] sdc5[4] sdd5[5] 5846050368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [_UUU] md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] 2097088 blocks [16/3] [_UUU____________] md0 : active raid1 sdb1[1] sdc1[3] sdd1[2] 2490176 blocks [12/3] [_UUU________] unused devices: <none> rowdy@prime-ds:/$ sudo mdadm --detail /dev/md2 Password: /dev/md2: Version : 1.2 Creation Time : Thu Sep 27 21:57:37 2018 Raid Level : raid5 Array Size : 5846050368 (5575.23 GiB 5986.36 GB) Used Dev Size : 1948683456 (1858.41 GiB 1995.45 GB) Raid Devices : 4 Total Devices : 3 Persistence : Superblock is persistent Update Time : Mon Dec 13 10:31:12 2021 State : clean, degraded Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Name : prime-ds:2 (local to host prime-ds) UUID : 8b09d0a0:b176c408:82b17a97:a84d1b39 Events : 4579924 Number Major Minor RaidDevice State - 0 0 0 removed 1 8 21 1 active sync /dev/sdb5 5 8 53 2 active sync /dev/sdd5 4 8 37 3 active sync /dev/sdc5 rowdy@prime-ds:/$ sudo vgdisplay -v Using volume group(s) on command line. /dev/md3: read failed after 0 of 4096 at 0: Input/output error Wiping cache of LVM-capable devices /dev/md3: read failed after 0 of 4096 at 0: Input/output error /dev/md3: read failed after 0 of 4096 at 3000554160128: Input/output error /dev/md3: read failed after 0 of 4096 at 3000554217472: Input/output error /dev/md3: read failed after 0 of 4096 at 0: Input/output error /dev/md3: read failed after 0 of 4096 at 4096: Input/output error /dev/md3: read failed after 0 of 4096 at 0: Input/output error Couldn't find device with uuid AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd. There are 1 physical volumes missing. There are 1 physical volumes missing. --- Volume group --- VG Name vg1000 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 11 VG Access read/write VG Status resizable MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 2 Act PV 1 VG Size 8.17 TiB PE Size 4.00 MiB Total PE 2142645 Alloc PE / Size 2142645 / 8.17 TiB Free PE / Size 0 / 0 VG UUID DXrwth-QERd-WcGx-B2aL-kgfR-TcEM-VcNKfe --- Logical volume --- LV Path /dev/vg1000/lv LV Name lv VG Name vg1000 LV UUID fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW LV Write Access read/write LV Creation host, time , LV Status available # open 1 LV Size 8.17 TiB Current LE 2142645 Segments 4 Allocation inherit Read ahead sectors auto - currently set to 4096 Block device 252:0 --- Physical volumes --- PV Name /dev/md2 PV UUID W53sd5-d2wI-jQLB-0qZF-Ncz9-Ex7Y-QLScaw PV Status allocatable Total PE / Free PE 1427258 / 0 PV Name unknown device PV UUID AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd PV Status allocatable Total PE / Free PE 715387 / 0 rowdy@prime-ds:/$ sudo lvdisplay -v Using logical volume(s) on command line. /dev/md3: read failed after 0 of 4096 at 0: Input/output error Wiping cache of LVM-capable devices /dev/md3: read failed after 0 of 4096 at 0: Input/output error /dev/md3: read failed after 0 of 4096 at 3000554160128: Input/output error /dev/md3: read failed after 0 of 4096 at 3000554217472: Input/output error /dev/md3: read failed after 0 of 4096 at 0: Input/output error /dev/md3: read failed after 0 of 4096 at 4096: Input/output error /dev/md3: read failed after 0 of 4096 at 0: Input/output error Couldn't find device with uuid AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd. There are 1 physical volumes missing. There are 1 physical volumes missing. --- Logical volume --- LV Path /dev/vg1000/lv LV Name lv VG Name vg1000 LV UUID fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW LV Write Access read/write LV Creation host, time , LV Status available # open 1 LV Size 8.17 TiB Current LE 2142645 Segments 4 Allocation inherit Read ahead sectors auto - currently set to 4096 Block device 252:0 rowdy@prime-ds:/$ sudo lvm pvscan /dev/md3: read failed after 0 of 4096 at 0: Input/output error /dev/md3: read failed after 0 of 4096 at 3000554160128: Input/output error /dev/md3: read failed after 0 of 4096 at 3000554217472: Input/output error /dev/md3: read failed after 0 of 4096 at 4096: Input/output error Couldn't find device with uuid AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd. PV /dev/md2 VG vg1000 lvm2 [5.44 TiB / 0 free] PV unknown device VG vg1000 lvm2 [2.73 TiB / 0 free] Total: 2 [8.17 TiB] / in use: 2 [8.17 TiB] / in no VG: 0 [0 ] rowdy@prime-ds:/$ sudo cat /etc/fstab none /proc proc defaults 0 0 /dev/root / ext4 defaults 1 1 /dev/vg1000/lv /volume1 ext4 usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0,synoacl,relatime,relatime_period=30 0 0 Edited December 13, 2021 by Rowdy Typos, added detailed information about md2 Quote
flyride Posted December 13, 2021 #2 Posted December 13, 2021 (edited) There are three different crashed entities on your system right now. 1. "Disk 1" physical drive - DSM is reporting it as crashed, but it is still working, at least to an extent. 2. /dev/md3 array within your SHR - which is missing /dev/sda6 logical drive 1 3. /dev/md2 array within your SHR - which is missing /dev/sdb5 logical drive 2 The #1 issue is what you think you need to fix by replacing a bad drive. The problem is that /dev/md2 is critical because of a loss of redundancy on another drive, not the one you are trying to replace. SHR needs both /dev/md2 and /dev/md3 to work in order to present your volume. When you try and replace the disk, /dev/md2 cannot start. You need to attempt to repair the /dev/md2 array with the disk you have prior to replacing it. DSM may not let you do this via the GUI because it thinks the physical disk is crashed. You can try and manually resync /dev/md2 with mdadm as root: # mdadm /dev/md2 --manage --add /dev/sdb5 EDIT: add target above should have been /dev/sda5 Edited December 14, 2021 by flyride wrong partition selected Quote
Rowdy Posted December 13, 2021 Author #3 Posted December 13, 2021 Thanks! I've tried that, with the old drive, new drive, no drive and reboots, but they all say the same; rowdy@prime-ds:/$ sudo mdadm /dev/md2 --manage --add /dev/sdb5 Password: mdadm: Cannot open /dev/sdb5: Device or resource busy How do find what's keeping the drive busy? Quote
flyride Posted December 13, 2021 #4 Posted December 13, 2021 Have you tried an in-place repair with the GUI? (Storage Pool screen) Quote
Rowdy Posted December 14, 2021 Author #5 Posted December 14, 2021 Unfortunatly, I can't do anything there I'm afraid. I can do health checks with both (OLD, NEW) drives in and hit 'Configure' with the OLD drive in. I've also checked what I could do on drive two, that shoudl have some errors as per your first reply, but no joy I'm afraid? Quote
flyride Posted December 14, 2021 #6 Posted December 14, 2021 Sorry, I got the two arrays crossed up. # mdadm /dev/md2 --manage --add /dev/sda5 Quote
Rowdy Posted December 14, 2021 Author #7 Posted December 14, 2021 No problem, I was not able deduce that there where two arrays myself Using the OLD drive, I could perform that command, and the storage manager seems to be checking parity now. Does that mean it's finding the loss in redundancy and it's going to correct that? Quote
flyride Posted December 14, 2021 #8 Posted December 14, 2021 Yes, it is trying to recreate parity from the "bad" drive for that specific subarray. Once that is done, you should be able to replace the drive. Quote
Rowdy Posted December 14, 2021 Author #9 Posted December 14, 2021 It's at 88%.. Fingers crossed! I'll keep you posted. Thansk for the help so far! Quote
Rowdy Posted December 15, 2021 Author #10 Posted December 15, 2021 Small update; after the check I could 'kick' the drive, however; I've did the add/parity check while the volume was not accessable.. 😒 So I could kick the drive, but could not add the new drive. So now I've booted with the old drive and an accessable volume, and restarted the add/parity check.. More news probably late this evening... Quote
Rowdy Posted December 15, 2021 Author #11 Posted December 15, 2021 Ow well. That was a bust. So, after parity check it shows as degraded, but fine. Can access all files etc. I get occasionally a message thet the system partition is damaged, if i want to repair that. Yes, and it's fine. I don't have the option to deactivate drive 1 (I could deactivate 3..4, not a good plan ) and If I yank it out the volume is crashed. If I enter the new drive, it won't show me the option to repair, probably because the volume is crashed? After that, if I put the old drive back, run the command you gave me (mdadm /dev/md2 --manage --add /dev/sda5) it wil start the check again and after A while I'm fine, with a busted drive.. All the commands I've put in seems a lot better, no errors and such.. One thing that's curious though, is that each time the parity check runs it runs faster (first time took 14 hours, second 8, last time 6 hours) but the bad sectors on drive one go waaaay up... Any suggestions left, or should I give up and recreate the whole thing? Spoiler rowdy@prime-ds:/$ sudo cat /proc/mdstat Password: Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md3 : active raid5 sda6[0] sdd6[4] sdc6[5] 2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [U_UU] md2 : active raid5 sda5[6] sdb5[1] sdc5[4] sdd5[5] 5846050368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0] 2097088 blocks [16/4] [UUUU____________] md0 : active raid1 sda1[0] sdb1[1] sdc1[3] sdd1[2] 2490176 blocks [12/4] [UUUU________] unused devices: <none> rowdy@prime-ds:/$ sudo vgdisplay -v Using volume group(s) on command line. --- Volume group --- VG Name vg1000 System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 11 VG Access read/write VG Status resizable MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 2 Act PV 2 VG Size 8.17 TiB PE Size 4.00 MiB Total PE 2142645 Alloc PE / Size 2142645 / 8.17 TiB Free PE / Size 0 / 0 VG UUID DXrwth-QERd-WcGx-B2aL-kgfR-TcEM-VcNKfe --- Logical volume --- LV Path /dev/vg1000/lv LV Name lv VG Name vg1000 LV UUID fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW LV Write Access read/write LV Creation host, time , LV Status available # open 1 LV Size 8.17 TiB Current LE 2142645 Segments 4 Allocation inherit Read ahead sectors auto - currently set to 4096 Block device 252:0 --- Physical volumes --- PV Name /dev/md2 PV UUID W53sd5-d2wI-jQLB-0qZF-Ncz9-Ex7Y-QLScaw PV Status allocatable Total PE / Free PE 1427258 / 0 PV Name /dev/md3 PV UUID AxKbfA-Cffw-05S9-AccQ-W1NZ-Bwpn-7xGrfd PV Status allocatable Total PE / Free PE 715387 / 0 rowdy@prime-ds:/$ sudo lvdisplay -v Using logical volume(s) on command line. --- Logical volume --- LV Path /dev/vg1000/lv LV Name lv VG Name vg1000 LV UUID fw5xk2-x8ss-cUM1-LGKI-5ZR6-dz2Y-0ahnzW LV Write Access read/write LV Creation host, time , LV Status available # open 1 LV Size 8.17 TiB Current LE 2142645 Segments 4 Allocation inherit Read ahead sectors auto - currently set to 4096 Block device 252:0 rowdy@prime-ds:/$ sudo lvm pvscan PV /dev/md2 VG vg1000 lvm2 [5.44 TiB / 0 free] PV /dev/md3 VG vg1000 lvm2 [2.73 TiB / 0 free] Total: 2 [8.17 TiB] / in use: 2 [8.17 TiB] / in no VG: 0 [0 ] rowdy@prime-ds:/$ sudo cat /etc/fstab none /proc proc defaults 0 0 /dev/root / ext4 defaults 1 1 /dev/vg1000/lv /volume1 ext4 usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0,synoacl,relatime,relatime_period=30 0 0 Quote
flyride Posted December 15, 2021 #12 Posted December 15, 2021 (edited) What you are doing is repeatedly breaking /dev/md2 and repairing it (which is no longer necessary). Your drive is failing so it keeps replacing bad sectors. Let's stop doing that, because eventually you will run out of replacement sectors. I'm concerned that we don't know which drive is actually experiencing the issue and we may be repairing the wrong array so that you can properly replace it. If so we can issue a different mdadm command and repair the /dev/md3 array and try again. Please verify this by: # smartctl -d sat --all /dev/sda | fgrep -i sector Edited December 15, 2021 by flyride Quote
Rowdy Posted December 16, 2021 Author #13 Posted December 16, 2021 I was aware of that, but I tried it three times because I thought I screwed up; ie with repairing it while the volume wasn't accessable.. rowdy@prime-ds:/$ sudo smartctl -d sat --all /dev/sda | fgrep -i sector Sector Sizes: 512 bytes logical, 4096 bytes physical 5 Reallocated_Sector_Ct 0x0033 195 195 140 Pre-fail Always - 154 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 Quote
flyride Posted December 16, 2021 #14 Posted December 16, 2021 Ok, let's fix the other array. # mdadm /dev/md3 --manage --add /dev/sdb6 Before you attempt to replace a drive, please post an mdstat. Quote
Rowdy Posted December 16, 2021 Author #15 Posted December 16, 2021 (edited) I will. Parity consistency check is running. Just the below command I presume? rowdy@prime-ds:/$ sudo cat /proc/mdstat And now I'm looking back,I've totally missed it there was something wrong... So this old output md3 : active raid5 sda6[0] sdd6[4] sdc6[5] 2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [U_UU] Should become something like this if the parity consistency is ready? md3 : active raid5 sda6[0] sdd6[4] sdc6[5] 2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU] Trying to grasp what I'm doing, so next time I hopefully don't have to waste your time.. 😬 Edited December 16, 2021 by Rowdy Quote
flyride Posted December 16, 2021 #16 Posted December 16, 2021 (edited) 51 minutes ago, Rowdy said: I will. Parity consistency check is running. Just the below command I presume? rowdy@prime-ds:/$ sudo cat /proc/mdstat Yes, let's be totally certain of the array state prior to removing a disk (or disabling in the UI, or whatever). 51 minutes ago, Rowdy said: Should become something like this if the parity consistency is ready? md3 : active raid5 sda6[0] sdd6[4] sdc6[5] 2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU] Trying to grasp what I'm doing, so next time I hopefully don't have to waste your time.. 😬 Yep. All three remaining drives (sdb, sdc, sdd) need to be current and active ("U") on both arrays prior to removing Drive 1 (sda). Is your drive bay hot-swappable, or are you powering down between disk operations? Edited December 16, 2021 by flyride Quote
Rowdy Posted December 16, 2021 Author #17 Posted December 16, 2021 Yes, my drives are hot-swappable. :) The parity consistency check is done, and the volume has entered a new status; the warning status. That's a new one! Mdstat also seems okay? rowdy@prime-ds:/$ sudo cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md3 : active raid5 sdb6[6] sda6[0] sdd6[4] sdc6[5] 2930228736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md2 : active raid5 sda5[6] sdb5[1] sdc5[4] sdd5[5] 5846050368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0] 2097088 blocks [16/4] [UUUU____________] md0 : active raid1 sda1[0] sdb1[1] sdc1[3] sdd1[2] 2490176 blocks [12/4] [UUUU________] unused devices: <none> Quote
flyride Posted December 16, 2021 #18 Posted December 16, 2021 All looks good from here. If the volume is currently accessible from the network, I would go ahead and attempt the drive replacement. If the volume is not currently accessible, do a soft reboot and verify that the volume is accessible after the reboot. Also check to make sure there is no change to the mdstat. If all is well that that point, then attempt the drive replacement. 1 Quote
Rowdy Posted December 17, 2021 Author #19 Posted December 17, 2021 It works! It just works! And a green 'healthy' sign, I'm really happy. Thank you very, very much and might you ever find yourself in the vicinity of Venlo, the Netherlands, swing by, I'll buy you a beer. (or ten) 1 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.