Help save my 55TB SHR1! Or mount it via Ubuntu :(

flyride · January 23, 2020

Your drives have reordered yet again. I know IG-88 said your controller deliberately presents them contiguously (which is problematic in itself) but if all drives are up and stable, I cannot see why that behavior would cause a reorder on reboot. I remain very wary of your hardware consistency.

Look through dmesg and see if you have any hardware problems since your power cycle boot.

Run another hotswap query and see if any drives have changed state since your power cycle boot.

Run another mdstat - is it still slow?

C-Fu · January 23, 2020

37 minutes ago, flyride said:

Run another mdstat - is it still slow?

Yeah it is. Slow, but working.

# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [raidF1] 
md4 : active raid5 sdl6[0] sdn6[2] sdm6[1]
      11720987648 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/3] [UUU__]
      
md2 : active raid5 sdb5[0] sdk5[12] sdo5[11] sdq5[9] sdp5[8] sdn5[7] sdm5[6] sdl5[5] sdf5[4] sde5[3] sdd5[2] sdc5[1]
      35105225472 blocks super 1.2 level 5, 64k chunk, algorithm 2 [13/12] [UUUUUUUUUU_UU]
      
md5 : active raid1 sdo7[3]
      3905898432 blocks super 1.2 [2/0] [__]
      
md1 : active raid1 sdb2[0] sdc2[1] sdd2[2] sde2[3] sdf2[4] sdk2[5] sdl2[6] sdm2[7] sdn2[11] sdo2[8] sdp2[9] sdq2[10]
      2097088 blocks [24/12] [UUUUUUUUUUUU____________]
      
md0 : active raid1 sdb1[1] sdc1[2] sdd1[3] sdf1[5]
      2490176 blocks [12/4] [_UUU_U______]
      
unused devices: <none>

37 minutes ago, flyride said:

I remain very wary of your hardware consistency.

Just out of curiosity, does this mean that if I were to replace the current sas card to another, would it fix?
An IBM sas expander has just arrived, would this + my current 2 port sas card help somehow? Or is it because of something else, like motherboard?

I just did a notepad++ compare with Post ID 113 for fdisk, and seems like nothing has changed.

# fdisk -l /dev/sd?
Disk /dev/sda: 223.6 GiB, 240057409536 bytes, 468862128 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x696935dc

Device     Boot Start       End   Sectors   Size Id Type
/dev/sda1        2048 468857024 468854977 223.6G fd Linux raid autodetect
Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 43C8C355-AE0A-42DC-97CC-508B0FB4EF37

Device       Start        End    Sectors  Size Type
/dev/sdb1     2048    4982527    4980480  2.4G Linux RAID
/dev/sdb2  4982528    9176831    4194304    2G Linux RAID
/dev/sdb5  9453280 5860326239 5850872960  2.7T Linux RAID
Disk /dev/sdc: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 0600DFFC-A576-4242-976A-3ACAE5284C4C

Device       Start        End    Sectors  Size Type
/dev/sdc1     2048    4982527    4980480  2.4G Linux RAID
/dev/sdc2  4982528    9176831    4194304    2G Linux RAID
/dev/sdc5  9453280 5860326239 5850872960  2.7T Linux RAID
Disk /dev/sdd: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 58B43CB1-1F03-41D3-A734-014F59DE34E8

Device       Start        End    Sectors  Size Type
/dev/sdd1     2048    4982527    4980480  2.4G Linux RAID
/dev/sdd2  4982528    9176831    4194304    2G Linux RAID
/dev/sdd5  9453280 5860326239 5850872960  2.7T Linux RAID
Disk /dev/sde: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: E5FD9CDA-FE14-4F95-B776-B176E7130DEA

Device       Start        End    Sectors  Size Type
/dev/sde1     2048    4982527    4980480  2.4G Linux RAID
/dev/sde2  4982528    9176831    4194304    2G Linux RAID
/dev/sde5  9453280 5860326239 5850872960  2.7T Linux RAID
Disk /dev/sdf: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 48A13430-10A1-4050-BA78-723DB398CE87

Device       Start        End    Sectors  Size Type
/dev/sdf1     2048    4982527    4980480  2.4G Linux RAID
/dev/sdf2  4982528    9176831    4194304    2G Linux RAID
/dev/sdf5  9453280 5860326239 5850872960  2.7T Linux RAID
Disk /dev/sdk: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: A3E39D34-4297-4BE9-B4FD-3A21EFC38071

Device       Start        End    Sectors  Size Type
/dev/sdk1     2048    4982527    4980480  2.4G Linux RAID
/dev/sdk2  4982528    9176831    4194304    2G Linux RAID
/dev/sdk5  9453280 5860326239 5850872960  2.7T Linux RAID
Disk /dev/sdl: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 849E02B2-2734-496B-AB52-A572DF8FE63F

Device          Start         End    Sectors  Size Type
/dev/sdl1        2048     4982527    4980480  2.4G Linux RAID
/dev/sdl2     4982528     9176831    4194304    2G Linux RAID
/dev/sdl5     9453280  5860326239 5850872960  2.7T Linux RAID
/dev/sdl6  5860342336 11720838239 5860495904  2.7T Linux RAID
Disk /dev/sdm: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 423D33B4-90CE-4E34-9C40-6E06D1F50C0C

Device          Start         End    Sectors  Size Type
/dev/sdm1        2048     4982527    4980480  2.4G Linux RAID
/dev/sdm2     4982528     9176831    4194304    2G Linux RAID
/dev/sdm5     9453280  5860326239 5850872960  2.7T Linux RAID
/dev/sdm6  5860342336 11720838239 5860495904  2.7T Linux RAID
Disk /dev/sdn: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 09CB7303-C2E7-46F8-ADA0-D4853F25CB00

Device          Start         End    Sectors  Size Type
/dev/sdn1        2048     4982527    4980480  2.4G Linux RAID
/dev/sdn2     4982528     9176831    4194304    2G Linux RAID
/dev/sdn5     9453280  5860326239 5850872960  2.7T Linux RAID
/dev/sdn6  5860342336 11720838239 5860495904  2.7T Linux RAID
Disk /dev/sdo: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 1713E819-3B9A-4CE3-94E8-5A3DBF1D5983

Device           Start         End    Sectors  Size Type
/dev/sdo1         2048     4982527    4980480  2.4G Linux RAID
/dev/sdo2      4982528     9176831    4194304    2G Linux RAID
/dev/sdo5      9453280  5860326239 5850872960  2.7T Linux RAID
/dev/sdo6   5860342336 11720838239 5860495904  2.7T Linux RAID
/dev/sdo7  11720854336 19532653311 7811798976  3.7T Linux RAID
Disk /dev/sdp: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 1D5B8B09-8D4A-4729-B089-442620D3D507

Device       Start        End    Sectors  Size Type
/dev/sdp1     2048    4982527    4980480  2.4G Linux RAID
/dev/sdp2  4982528    9176831    4194304    2G Linux RAID
/dev/sdp5  9453280 5860326239 5850872960  2.7T Linux RAID
Disk /dev/sdq: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 54D81C51-AB85-4DE2-AA16-263DF1C6BB8A

Device       Start        End    Sectors  Size Type
/dev/sdq1     2048    4982527    4980480  2.4G Linux RAID
/dev/sdq2  4982528    9176831    4194304    2G Linux RAID
/dev/sdq5  9453280 5860326239 5850872960  2.7T Linux RAID

# dmesg | tail
[38440.957234]  --- wd:0 rd:2
[38440.957237] RAID1 conf printout:
[38440.957238]  --- wd:0 rd:2
[38440.957239]  disk 0, wo:1, o:1, dev:sdo7
[38440.957258] md: md5: set sdo7 to auto_remap [1]
[38440.957260] md: recovery of RAID array md5
[38440.957262] md: minimum _guaranteed_  speed: 600000 KB/sec/disk.
[38440.957263] md: using maximum available idle IO bandwidth (but not more than 800000 KB/sec) for recovery.
[38440.957264] md: using 128k window, over a total of 3905898432k.
[38440.957535] md: md5: set sdo7 to auto_remap [0]

This looks like OK, right?

Oddly enough fgrep hotswap doesn't return anything. But the last few hundred lines of cat /var/log/disk.log are

2020-01-24T02:23:21+08:00 homelab kernel: [38410.974657] md: md5: set sdo7 to auto_remap [1]
2020-01-24T02:23:21+08:00 homelab kernel: [38410.974659] md: recovery of RAID array md5
2020-01-24T02:23:21+08:00 homelab kernel: [38410.974945] md: md5: set sdo7 to auto_remap [0]
2020-01-24T02:23:21+08:00 homelab kernel: [38411.005717] md: md5: set sdo7 to auto_remap [1]
2020-01-24T02:23:21+08:00 homelab kernel: [38411.005718] md: recovery of RAID array md5
2020-01-24T02:23:21+08:00 homelab kernel: [38411.005961] md: md5: set sdo7 to auto_remap [0]
2020-01-24T02:23:21+08:00 homelab kernel: [38411.038632] md: md5: set sdo7 to auto_remap [1]
2020-01-24T02:23:21+08:00 homelab kernel: [38411.038634] md: recovery of RAID array md5
2020-01-24T02:23:21+08:00 homelab kernel: [38411.038873] md: md5: set sdo7 to auto_remap [0]
2020-01-24T02:23:21+08:00 homelab kernel: [38411.074782] md: md5: set sdo7 to auto_remap [1]
2020-01-24T02:23:21+08:00 homelab kernel: [38411.074784] md: recovery of RAID array md5
2020-01-24T02:23:21+08:00 homelab kernel: [38411.074973] md: md5: set sdo7 to auto_remap [0]
2020-01-24T02:23:21+08:00 homelab kernel: [38411.106766] md: md5: set sdo7 to auto_remap [1]
2020-01-24T02:23:21+08:00 homelab kernel: [38411.106767] md: recovery of RAID array md5
2020-01-24T02:23:21+08:00 homelab kernel: [38411.106956] md: md5: set sdo7 to auto_remap [0]

And that's the current time.

Edited January 23, 2020 by C-Fu

flyride · January 23, 2020

Everything is going up and down right now. You can see the changed drive assignments between the two last posted mdstats. We can't do anything with this until it's stable.

C-Fu · January 23, 2020

8 minutes ago, flyride said:

You can see the changed drive assignments between the two last posted mdstats.

Damn. You're right.

Usually when something like this happens... is there a way to prevent the sas card from doing this? Like a setting or a bios update or something. Or does this mean that the card is dying?

If I take out say, sda - the SSD and put it back in, will the assignments change and revert back? Or whatever drive connected to the sas card. Sorry I'm just frustrated but still wanna understand

Edited January 23, 2020 by C-Fu

flyride · January 23, 2020

I can't really answer your question. Drives are going up and down. That can happen because the interface is unreliable, or the power is unreliable. A logic problem in the SAS card is way more likely to be a total failure, not an intermittent one.

If it were me, I would completely replace all your SATA cables and the power supply.

C-Fu · January 24, 2020

16 hours ago, flyride said:

If it were me, I would completely replace all your SATA cables and the power supply.

I just changed from 750W psu to a 1600W psu that's fairly new (only a few day's use max), so I don't believe the PSU is the problem.

When I get back on monday, I'll see if I can replace the whole system (I have a few motherboards unused) and cables and whatnot and reuse the SAS card if that's not likely the issue, and maybe reinstall Xpenology. Would that be a good idea?

flyride · January 24, 2020

3 hours ago, C-Fu said:

I just changed from 750W psu to a 1600W psu that's fairly new (only a few day's use max), so I don't believe the PSU is the problem.

When I get back on monday, I'll see if I can replace the whole system (I have a few motherboards unused) and cables and whatnot and reuse the SAS card if that's not likely the issue, and maybe reinstall Xpenology. Would that be a good idea?

If all your problems started after that power supply replacement, this further reinforces the idea of stable power. You seem reluctant to believe that a new power supply can be a problem (it can). For what it's worth, 13 drives x 5w equals 65w, that shouldn't be a factor.

In any debugging and recovery operation, the objective should be to manage the change rate and therefore risk. Replacing the whole system would violate that strategy.

Do the drive connectivity failures implicate a SAS card problem? Maybe, but a much more plausible explanation is physical connectivity or power. If you have an identical SAS card, and it is passive (no intrinsic configuration required), replacing it is a low risk troubleshooting strategy.
Do failures implicate the motherboard? Maybe, if you are using on-board SATA ports, but the same plausibility test applies. However, there is more variability and risk (mobo model, BIOS settings, etc).
Do failures implicate DSM or loader stability? Not at all; DSM boots fine and is not crashing. And if you reinstall DSM, it's very likely your arrays will be destructively reconfigured. Please don't do this.

So I'll stand by (and extend) my previous statement - if this were my system, I would change your power and cables first. If that doesn't solve things, maybe the SAS card, and lastly the motherboard.

IG-88 · January 24, 2020

On 1/24/2020 at 5:18 PM, flyride said:

So I'll stand by (and extend) my previous statement - if this were my system, I would change your power and cables first. If that doesn't solve things, maybe the SAS card, and lastly the motherboard.

if the changes in order are related to the sas controller/driver its possible to use a 8 port sata/ahci controller

https://xpenology.com/forum/topic/19854-sata-controllers-not-recognized/?do=findComment&comment=122709

Edited January 25, 2020 by IG-88

flyride · January 24, 2020

That looks like a pretty nice low cost card, and with an onboard PCIe switch too.

IG-88 · January 25, 2020

On 1/24/2020 at 8:59 PM, flyride said:

That looks like a pretty nice low cost card, and with an onboard PCIe switch too.

it might not get the most out of a system with ssd only but on the other hand people with that in mind would no use sata drives, m.2 nvme would be the solution for this

also there are usually still onboard sata connectors for using ssd's with "full" 6Gb/s

IG-88 · February 2, 2020

@flyride

synology seem to not use bitmap with its raid

in theory it could be a great help to have this when a drive drops out of the raid as it could be re-synced in just a few minutes or seconds

we know by the event number that most of the fallen out drive is good and we don't have to write multi TB data to get it into the raid again

any reason (beside performance) not to do a

mdadm --grow --bitmap=internal /dev/md2

"It uses space that the alignment requirements of the metadata assure us is otherwise unused. For v0.90, that is limited to 60K. For 1.x it is 3K. As this is unused disk space, bitmaps can be added to an existing md device without the risk to take away space from an existing filesystem on that device."

and if there would be a performance problem it could be removed at any time without impact

mdadm --grow --bitmap=none /dev/md2

https://raid.wiki.kernel.org/index.php/Write-intent_bitmap

https://raid.wiki.kernel.org/index.php/Mdstat#bitmap_line

it does not help to resolve hardware problems but might decrease the odds another drive is failing in the time windows of decreased redundancy by shorten the time to a minimum

Help save my 55TB SHR1! Or mount it via Ubuntu :(

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation