Storage pool repair

peterzil · October 19, 2020

Hi all

For unknown reason the storage pool was crashed. 2 of 3 disk being with status "Initialized", completely healthy.

How I can repair it ?

flyride · October 25, 2020

Ugh. btrfs stores its superblocks in three different places and we just tried to look for all of them, but the btrfs binary keeps crashing (on #1 and #3, #2 returned unusable data). For the sake of completeness, please post a dmesg to see if there is any kernel-related log information about the last crash.

Because of the btrfs crashes, we have not positively proven that all three superblocks are inaccessible. Really the only thing left to try now is install a new Linux system, connect the drives to it, and see if a new Linux kernel and latest btrfs utilities would be able to read anything useful without core dumping. I suppose you could also try and reinstall DSM (maybe using the DS918 platform since it has the newest kernel) and see if that makes a difference, but I don't hold out much hope for that.

Barring that result, whatever happened to your drives has caused them to return data that are so corrupted that there is probably no filesystem recovery possible without forensic tools. However, we haven't written over the filesystem areas of the disk, so forensic recovery should still be possible. And, the new metadata we created for the array will help a forensic lab know the correct order of the disks, should you decide to go in that direction.

If you decide to abandon the array and remake it, test the two failed drives very carefully before putting production data on them again, because this could be the result of controller or drive failure (although two drives failing in this way at the same time seems unlikely).

We did everything reasonable to recover this data. I'm sorry that the results were not better.

flyride · October 19, 2020

Log in via ssh and post output of "cat /proc/mdstat"

peterzil · October 19, 2020

admin@XP:/$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [raidF1]
md3 : active raid5 sdi3[1]
11711401088 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/1] [_U_]

md4 : active raid1 sda1[0] sdb1[1]
117216192 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdg3[0]
3902196544 blocks super 1.2 [1/1] [U]

md1 : active raid1 sdg2[0] sdh2[1] sdi2[2] sdj2[3]
2097088 blocks [12/4] [UUUU________]

md0 : active raid1 sdg1[0] sdh1[1] sdi1[2] sdj1[3]
2490176 blocks [12/4] [UUUU________]

unused devices: <none>

flyride · October 19, 2020

Well this does look a little odd. I would expect your /dev/sda and /dev/sdb in md0 and md1. Is this a baremetal install?

Post the output of the following commands. You'll need to be root (sudo -i)

# mdadm --detail /dev/md3

# mdadm --examine /dev/sd[hij]3 | egrep 'Event|/dev/sd'

peterzil · October 19, 2020

root@XP:~# mdadm --detail /dev/md3
/dev/md3:
Version : 1.2
Creation Time : Sat Nov 16 12:10:31 2019
Raid Level : raid5
Array Size : 11711401088 (11168.86 GiB 11992.47 GB)
Used Dev Size : 5855700544 (5584.43 GiB 5996.24 GB)
Raid Devices : 3
Total Devices : 1
Persistence : Superblock is persistent

Update Time : Mon Oct 19 22:13:04 2020
State : clean, FAILED
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Name : XPEH:2
UUID : 22a4b5c5:8103a815:1de617b2:3f23ee03
Events : 376

Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 131 1 active sync /dev/sdi3
- 0 0 2 removed

root@XP:~# mdadm --examine /dev/sd[hij]3 | egrep 'Event|/dev/sd'
mdadm: No md superblock detected on /dev/sdh3.
mdadm: No md superblock detected on /dev/sdj3.
/dev/sdi3:
Events : 376

Thank you

flyride · October 19, 2020

So the two missing RAID5 disks exist but the partition looks badly damaged. Is the partition structure even still present for the RAID5?

# fdisk -l /dev/sdh

# fdisk -l /dev/sdj

peterzil · October 19, 2020

I think so (like on pic 1)

root@XP:~# fdisk -l /dev/sdh
Disk /dev/sdh: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 987F18E3-DCA2-431A-9174-AADC0F9C53EC

Device Start End Sectors Size Type
/dev/sdh1 2048 4982527 4980480 2.4G Linux RAID
/dev/sdh2 4982528 9176831 4194304 2G Linux RAID
/dev/sdh3 9437184 11720840351 11711403168 5.5T Linux RAID

root@XP:~# fdisk -l /dev/sdj
Disk /dev/sdj: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: D6A7A561-92BA-4C7C-B9AA-7C6DA547F406

Device Start End Sectors Size Type
/dev/sdj1 2048 4982527 4980480 2.4G Linux RAID
/dev/sdj2 4982528 9176831 4194304 2G Linux RAID
/dev/sdj3 9437184 11720840351 11711403168 5.5T Linux RAID

flyride · October 19, 2020

Ok, here's where things are. The system knows it has a RAID5 array and sees one valid array member. Something has happened to the other two drives that seems to have overwritten the MD superblocks that identify partition #3 as part of the RAID5 array.

There is no graceful way of recovering this. But we can force the system to think that those disks are part of the array. This is a non-reversible operation and may still not result in the recovery of any data. For example, if the filesystem base structures was overwritten, it may be impossible to get the volume to mount. Or, some of your files or directories may be corrupted or missing. Needless to say, I hope you have this data backed up somewhere.

So before you decide to do any of that, let's get some answers to the following:

1. What happened before, or to cause this? You said "unknown reasons" but is there any information at all about the circumstances?

2. Is there anything that was deliberately done to get rid of /md1 and /md0 partition members for the SSD's?

3. Anything else you think is relevant should you decide to try to brute-force recover your array?

peterzil · October 20, 2020

What happened before? I think the volume was overflowing. The web access to the storage returned a message like: the system cannot display the page (Synology's error message, not a standard browser error)

After reboot the system asked me to install DSM again. After installation I get this situation.

No addition actions were performed.

I agree that trying to manually add disks to the raid it's only option with some chance to success. I will be grateful if you can tell me how to do this.

Thank you

peterzil · October 24, 2020

Do someone have a technical manual for DMS ? I want to try to add the disks in the RAID in Linux configuration files.

Thanks

IG-88 · October 24, 2020

8 hours ago, peterzil said:

Do someone have a technical manual for DMS ?

DSM uses just the normal mdadm und lvm stuff from linux

https://wiki.archlinux.org/index.php/RAID

https://www.thomas-krenn.com/en/wiki/Mdadm_recovery_and_resync

https://www.thomas-krenn.com/en/wiki/Mdadm_recover_degraded_Array_procedure

https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation

8 hours ago, peterzil said:

I want to try to add the disks in the RAID in Linux configuration files.

if you have never done this before i'd suggest to wait for flyride or at least read one ore two of his threads where he helped people with recovery

its most important to know the reason the the failure to prevent it happen again especially when doing things that cant be reverted and not having a backup

in real important recovery cases you might have a image file of every disk and can try more then one thing (and it would be running on a approved and tested hardware)

i do remember a case where he tried to help someone and the hardware in question did not work reliable - interesting read as long as its not your own data on that disks and you are just on the fence watching (and learning)

flyride · October 24, 2020

Sorry, I've had a crazy work schedule for the last couple of days, and not been able to get back to this. It isn't rocket science but I will post more detailed instructions when I get back from work in about 8 hours.

flyride · October 24, 2020

I had a few minutes, so here's a plan:

1. Retrieve the current array to filesystem relationship

2. Stop the array

3. Force (re-)create the array

4. Check the array for proper configuration before doing anything else (or report the exact failure response)

Assumptions based on the prior posts:

1. Array members are on disks h (8), i (9), j (10) and the array is ordered in that sequence

2. Data corruption has at least damaged the array superblocks (/dev/md3 RAID5)- but the extent is unknown

Comments and caveats:
Note that this is an irreversible operation. Any metadata on the disks containing the array state will be overwritten. Files on the disks are not damaged by this operation (so you could, in theory, send for forensic recovery still). It's possible that the create operation will fail without zeroing the array superblocks first. I don't like doing that unless it's absolutely necessary.

If corruption is extensive, the array will start but it will not be possible to mount the filesystem (we'll try and check that after the array creates correctly).

Feel free to question, research, verify this suggestion prior to executing. At the end of the day it's your data, and your decision to follow the free advice you obtain here. Again, I hope you have a backup, because we already know there is some amount of data loss.

Commands, execute in sequence:

# cat /etc/fstab

# mdadm --stop /dev/md3
# mdadm -v --create --assume-clean -e1.2 -n3 -l5 /dev/md3 /dev/sdh3 /dev/sdi3 /dev/sdj3 -u22a4b5c5:8103a815:1de617b2:3f23ee03

# cat /proc/mdstat

Post output, including error messages from each of these.

peterzil · October 24, 2020

Thank you.

This is the log of the commands:

root@XP:~# cat /etc/fstab
none /proc proc defaults 0 0
/dev/root / ext4 defaults 1 1
/dev/md3 /volume2 btrfs 0 0
/dev/mapper/cachedev_0 /volume1 ext4 usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0,synoacl,relatime 0 0
root@XP:~# mdadm --stop /dev/md3
mdadm: stopped /dev/md3
root@XP:~# mdadm -v --create --assume-clean -e1.2 -n3 -l5 /dev/md3 /dev/sdh3 /dev/sdi3 /dev/sdj3 -u22a4b5c5:8103a815:1de617b2:3f23ee03
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 64K
mdadm: /dev/sdi3 appears to be part of a raid array:
level=raid5 devices=3 ctime=Sat Nov 16 12:10:31 2019
mdadm: size set to 5855700544K
Continue creating array? y
mdadm: array /dev/md3 started.
root@XP:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [raidF1]
md3 : active raid5 sdj3[2] sdi3[1] sdh3[0]
11711401088 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md4 : active raid1 sda1[0] sdb1[1]
117216192 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdg3[0]
3902196544 blocks super 1.2 [1/1] [U]

md1 : active raid1 sdg2[0] sdh2[1] sdi2[2] sdj2[3]
2097088 blocks [12/4] [UUUU________]

md0 : active raid1 sdg1[0] sdh1[1] sdi1[2] sdj1[3]
2490176 blocks [12/4] [UUUU________]

unused devices: <none>

The storage pool was repaired, but the volume is still "crashed"

flyride · October 25, 2020

Yep. There is corruption to the extent that DSM thinks it must be an ext4 volume because it cannot find the initial btrfs superblock.

Your fstab says it was previously mounted as a btrfs volume, do you concur with that? If so, try and recover the btrfs superblock with:

# btrfs rescue super-recover -v /dev/md3

If it errors out, post the error. If it suggests that it may have fixed the superblock, try mounting the volume in recovery mode:

# mount -vs -t btrfs -o ro,recovery,errors=continue /dev/md3 /volume2

peterzil · October 25, 2020

You are right - it was btrfs

root@XP:~# btrfs rescue super-recover -v /dev/md3
No valid Btrfs found on /dev/md3
Usage or syntax errors
Segmentation fault (core dumped)

flyride · October 25, 2020

Ok, let's see if there is any valid superblock in btrfs.

# btrfs ins dump-super -fFa /dev/md3

peterzil · October 25, 2020

root@XP:~# btrfs ins dump-super -fFa /dev/md3
superblock: bytenr=65536, device=/dev/md3
---------------------------------------------------------
btrfs: ctree.h:2183: btrfs_super_csum_size: Assertion `!(t >= (sizeof(btrfs_csum_sizes) / sizeof((btrfs_csum_sizes)[0])))' failed.
csum 0xAborted (core dumped)

flyride · October 25, 2020

Ok, btrfs is crashing before we have tested all three superblocks. So let's try and reach the other two directly:

# btrfs ins dump-super -Ffs 67108864 /dev/md3

# btrfs ins dump-super -Ffs 274877906944 /dev/md3

peterzil · October 25, 2020

This is the log of the commands:

root@XP:~# btrfs ins dump-super -Ffs 67108864 /dev/md3
superblock: bytenr=67108864, device=/dev/md3
---------------------------------------------------------
csum 0x00000000 [DON'T MATCH]
bytenr 0
flags 0x0
magic ........ [DON'T MATCH]
fsid 00000000-0000-0000-0000-000000000000
label
generation 0
root 0
sys_array_size 0
chunk_root_generation 0
root_level 0
chunk_root 0
chunk_root_level 0
log_root 0
log_root_transid 0
log_root_level 0
total_bytes 0
bytes_used 0
sectorsize 0
nodesize 0
leafsize 0
stripesize 0
root_dir 0
num_devices 0
compat_flags 0x0
compat_ro_flags 0x0
incompat_flags 0x0
csum_type 0
csum_size 4
cache_generation 0
uuid_tree_generation 0
dev_item.uuid 00000000-0000-0000-0000-000000000000
dev_item.fsid 00000000-0000-0000-0000-000000000000 [match]
dev_item.type 0
dev_item.total_bytes 0
dev_item.bytes_used 0
dev_item.io_align 0
dev_item.io_width 0
dev_item.sector_size 0
dev_item.devid 0
dev_item.dev_group 0
dev_item.seek_speed 0
dev_item.bandwidth 0
dev_item.generation 0
sys_chunk_array[2048]:
backup_roots[4]:

root@XP:~# btrfs ins dump-super -Ffs 274877906944 /dev/md3
superblock: bytenr=274877906944, device=/dev/md3
---------------------------------------------------------
btrfs: ctree.h:2183: btrfs_super_csum_size: Assertion `!(t >= (sizeof(btrfs_csum_sizes) / sizeof((btrfs_csum_sizes)[0])))' failed.
csum 0xAborted (core dumped)

peterzil · October 25, 2020

Despite the unfortunate result, I really appreciate your help.

You did a lot for me.

I wish there were more professionals like you.

Have a nice day

Storage pool repair

Question

Link to comment

Share on other sites

21 answers to this question

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation