• 0

Storage pool repair


Question

21 answers to this question

Recommended Posts

  • 1

Ugh. btrfs stores its superblocks in three different places and we just tried to look for all of them, but the btrfs binary keeps crashing (on #1 and #3, #2 returned unusable data).  For the sake of completeness, please post a dmesg to see if there is any kernel-related log information about the last crash.

 

Because of the btrfs crashes, we have not positively proven that all three superblocks are inaccessible. Really the only thing left to try now is install a new Linux system, connect the drives to it, and see if a new Linux kernel and latest btrfs utilities would be able to read anything useful without core dumping.  I suppose you could also try and reinstall DSM (maybe using the DS918 platform since it has the newest kernel) and see if that makes a difference, but I don't hold out much hope for that.

 

Barring that result, whatever happened to your drives has caused them to return data that are so corrupted that there is probably no filesystem recovery possible without forensic tools. However, we haven't written over the filesystem areas of the disk, so forensic recovery should still be possible. And, the new metadata we created for the array will help a forensic lab know the correct order of the disks, should you decide to go in that direction.

 

If you decide to abandon the array and remake it, test the two failed drives very carefully before putting production data on them again, because this could be the result of controller or drive failure (although two drives failing in this way at the same time seems unlikely).

 

We did everything reasonable to recover this data. I'm sorry that the results were not better.

Link to post
Share on other sites
  • 0

admin@XP:/$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [raidF1]
md3 : active raid5 sdi3[1]
      11711401088 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/1] [_U_]

md4 : active raid1 sda1[0] sdb1[1]
      117216192 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdg3[0]
      3902196544 blocks super 1.2 [1/1] [U]

md1 : active raid1 sdg2[0] sdh2[1] sdi2[2] sdj2[3]
      2097088 blocks [12/4] [UUUU________]

md0 : active raid1 sdg1[0] sdh1[1] sdi1[2] sdj1[3]
      2490176 blocks [12/4] [UUUU________]

unused devices: <none>
 

Link to post
Share on other sites
  • 0

Well this does look a little odd. I would expect your /dev/sda and /dev/sdb in md0 and md1.  Is this a baremetal install?

 

Post the output of the following commands.  You'll need to be root (sudo -i)

 

# mdadm --detail /dev/md3

# mdadm --examine /dev/sd[hij]3 | egrep 'Event|/dev/sd'

 

Link to post
Share on other sites
  • 0

root@XP:~# mdadm --detail /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Sat Nov 16 12:10:31 2019
     Raid Level : raid5
     Array Size : 11711401088 (11168.86 GiB 11992.47 GB)
  Used Dev Size : 5855700544 (5584.43 GiB 5996.24 GB)
   Raid Devices : 3
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Mon Oct 19 22:13:04 2020
          State : clean, FAILED
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : XPEH:2
           UUID : 22a4b5c5:8103a815:1de617b2:3f23ee03
         Events : 376

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      131        1      active sync   /dev/sdi3
       -       0        0        2      removed
 

root@XP:~# mdadm --examine /dev/sd[hij]3 | egrep 'Event|/dev/sd'
mdadm: No md superblock detected on /dev/sdh3.
mdadm: No md superblock detected on /dev/sdj3.
/dev/sdi3:
         Events : 376
 

 

Thank you

 

Link to post
Share on other sites
  • 0

So the two missing RAID5 disks exist but the partition looks badly damaged.  Is the partition structure even still present for the RAID5?

 

# fdisk -l /dev/sdh

# fdisk -l /dev/sdj

 

Link to post
Share on other sites
  • 0

I think so (like on pic 1)

 

root@XP:~# fdisk -l /dev/sdh
Disk /dev/sdh: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 987F18E3-DCA2-431A-9174-AADC0F9C53EC

Device       Start         End     Sectors  Size Type
/dev/sdh1     2048     4982527     4980480  2.4G Linux RAID
/dev/sdh2  4982528     9176831     4194304    2G Linux RAID
/dev/sdh3  9437184 11720840351 11711403168  5.5T Linux RAID
 

root@XP:~# fdisk -l /dev/sdj
Disk /dev/sdj: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: D6A7A561-92BA-4C7C-B9AA-7C6DA547F406

Device       Start         End     Sectors  Size Type
/dev/sdj1     2048     4982527     4980480  2.4G Linux RAID
/dev/sdj2  4982528     9176831     4194304    2G Linux RAID
/dev/sdj3  9437184 11720840351 11711403168  5.5T Linux RAID
 

Link to post
Share on other sites
  • 0

Ok, here's where things are.  The system knows it has a RAID5 array and sees one valid array member. Something has happened to the other two drives that seems to have overwritten the MD superblocks that identify partition #3 as part of the RAID5 array.

 

There is no graceful way of recovering this.  But we can force the system to think that those disks are part of the array.  This is a non-reversible operation and may still not result in the recovery of any data.  For example, if the filesystem base structures was overwritten, it may be impossible to get the volume to mount.  Or, some of your files or directories may be corrupted or missing.  Needless to say, I hope you have this data backed up somewhere.

 

So before you decide to do any of that, let's get some answers to the following:

 

1. What happened before, or to cause this?  You said "unknown reasons" but is there any information at all about the circumstances?

2. Is there anything that was deliberately done to get rid of /md1 and /md0 partition members for the SSD's?

3. Anything else you think is relevant should you decide to try to brute-force recover your array?

Link to post
Share on other sites
  • 0

What happened before? I think the volume was overflowing. The web access to the storage returned a message like: the system cannot display the page (Synology's error message, not a standard browser error)

After reboot the system asked me to install DSM again. After installation I get this situation.

No addition actions were performed.

I agree that trying to manually add disks to the raid it's only option with some chance to success. I will be grateful if you can tell me how to do this.

Thank you

Link to post
Share on other sites
  • 0
8 hours ago, peterzil said:

Do someone have a technical manual for DMS ?

 

DSM uses just the normal mdadm und lvm stuff from linux

 

https://wiki.archlinux.org/index.php/RAID

 

https://www.thomas-krenn.com/en/wiki/Mdadm_recovery_and_resync

https://www.thomas-krenn.com/en/wiki/Mdadm_recover_degraded_Array_procedure

 

https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation

 

8 hours ago, peterzil said:

 

I want to try to add the disks in the RAID in Linux configuration files.

 

if you have never done this before i'd suggest to wait for flyride or at least read one ore two of his threads where he helped people with recovery

its most important to know the reason the the failure to prevent it happen again especially when doing things that cant be reverted and not having a backup

in real important recovery cases you might have a image file of every disk and can try more then one thing (and it would be running on a approved and tested hardware)

 

i do remember a case where he tried to help someone and the hardware in question did not work reliable - interesting read as long as its not your own data on that disks and you are just on the fence watching (and learning)

Link to post
Share on other sites
  • 0

Sorry, I've had a crazy work schedule for the last couple of days, and not been able to get back to this. It isn't rocket science but I will post more detailed instructions when I get back from work in about 8 hours.

Link to post
Share on other sites
  • 0

I had a few minutes, so here's a plan:

1. Retrieve the current array to filesystem relationship

2. Stop the array

3. Force (re-)create the array

4. Check the array for proper configuration before doing anything else (or report the exact failure response)

 

Assumptions based on the prior posts:

1. Array members are on disks h (8), i (9), j (10) and the array is ordered in that sequence

2. Data corruption has at least damaged the array superblocks (/dev/md3 RAID5)- but the extent is unknown

 

Comments and caveats:
Note that this is an irreversible operation. Any metadata on the disks containing the array state will be overwritten.  Files on the disks are not damaged by this operation (so you could, in theory, send for forensic recovery still). It's possible that the create operation will fail without zeroing the array superblocks first.  I don't like doing that unless it's absolutely necessary.

 

If corruption is extensive, the array will start but it will not be possible to mount the filesystem (we'll try and check that after the array creates correctly).

 

Feel free to question, research, verify this suggestion prior to executing. At the end of the day it's your data, and your decision to follow the free advice you obtain here.  Again, I hope you have a backup, because we already know there is some amount of data loss.

 

Commands, execute in sequence:

# cat /etc/fstab

# mdadm --stop /dev/md3
# mdadm -v --create --assume-clean -e1.2 -n3 -l5 /dev/md3 /dev/sdh3 /dev/sdi3 /dev/sdj3 -u22a4b5c5:8103a815:1de617b2:3f23ee03

# cat /proc/mdstat

 

Post output, including error messages from each of these.

Link to post
Share on other sites
  • 0

Thank you. 

This is the log of the commands:

 

root@XP:~# cat /etc/fstab
none /proc proc defaults 0 0
/dev/root / ext4 defaults 1 1
/dev/md3 /volume2 btrfs  0 0
/dev/mapper/cachedev_0 /volume1 ext4 usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0,synoacl,relatime 0 0
root@XP:~# mdadm --stop /dev/md3
mdadm: stopped /dev/md3
root@XP:~# mdadm -v --create --assume-clean -e1.2 -n3 -l5 /dev/md3 /dev/sdh3 /dev/sdi3 /dev/sdj3 -u22a4b5c5:8103a815:1de617b2:3f23ee03
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 64K
mdadm: /dev/sdi3 appears to be part of a raid array:
       level=raid5 devices=3 ctime=Sat Nov 16 12:10:31 2019
mdadm: size set to 5855700544K
Continue creating array? y
mdadm: array /dev/md3 started.
root@XP:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [raidF1]
md3 : active raid5 sdj3[2] sdi3[1] sdh3[0]
      11711401088 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md4 : active raid1 sda1[0] sdb1[1]
      117216192 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdg3[0]
      3902196544 blocks super 1.2 [1/1] [U]

md1 : active raid1 sdg2[0] sdh2[1] sdi2[2] sdj2[3]
      2097088 blocks [12/4] [UUUU________]

md0 : active raid1 sdg1[0] sdh1[1] sdi1[2] sdj1[3]
      2490176 blocks [12/4] [UUUU________]

unused devices: <none>
 

The storage pool was repaired, but the volume is still "crashed"

 

 

2020-10-24 22_25_04-XP - Synology DiskStation.png

2020-10-24 22_25_12-XP - Synology DiskStation.png

Link to post
Share on other sites
  • 0

Yep.  There is corruption to the extent that DSM thinks it must be an ext4 volume because it cannot find the initial btrfs superblock.

 

Your fstab says it was previously mounted as a btrfs volume, do you concur with that?  If so, try and recover the btrfs superblock with:

 

# btrfs rescue super-recover -v /dev/md3

 

If it errors out, post the error.  If it suggests that it may have fixed the superblock, try mounting the volume in recovery mode:

 

# mount -vs -t btrfs -o ro,recovery,errors=continue /dev/md3 /volume2

Link to post
Share on other sites
  • 0

root@XP:~# btrfs ins dump-super -fFa /dev/md3
superblock: bytenr=65536, device=/dev/md3
---------------------------------------------------------
btrfs: ctree.h:2183: btrfs_super_csum_size: Assertion `!(t >= (sizeof(btrfs_csum_sizes) / sizeof((btrfs_csum_sizes)[0])))' failed.
csum                    0xAborted (core dumped)
 

Link to post
Share on other sites
  • 0

Ok, btrfs is crashing before we have tested all three superblocks.  So let's try and reach the other two directly:

 

# btrfs ins dump-super -Ffs 67108864 /dev/md3

# btrfs ins dump-super -Ffs 274877906944 /dev/md3

Link to post
Share on other sites
  • 0

This is the log of the commands:

 

root@XP:~# btrfs ins dump-super -Ffs 67108864 /dev/md3
superblock: bytenr=67108864, device=/dev/md3
---------------------------------------------------------
csum                    0x00000000 [DON'T MATCH]
bytenr                  0
flags                   0x0
magic                   ........ [DON'T MATCH]
fsid                    00000000-0000-0000-0000-000000000000
label
generation              0
root                    0
sys_array_size          0
chunk_root_generation   0
root_level              0
chunk_root              0
chunk_root_level        0
log_root                0
log_root_transid        0
log_root_level          0
total_bytes             0
bytes_used              0
sectorsize              0
nodesize                0
leafsize                0
stripesize              0
root_dir                0
num_devices             0
compat_flags            0x0
compat_ro_flags         0x0
incompat_flags          0x0
csum_type               0
csum_size               4
cache_generation        0
uuid_tree_generation    0
dev_item.uuid           00000000-0000-0000-0000-000000000000
dev_item.fsid           00000000-0000-0000-0000-000000000000 [match]
dev_item.type           0
dev_item.total_bytes    0
dev_item.bytes_used     0
dev_item.io_align       0
dev_item.io_width       0
dev_item.sector_size    0
dev_item.devid          0
dev_item.dev_group      0
dev_item.seek_speed     0
dev_item.bandwidth      0
dev_item.generation     0
sys_chunk_array[2048]:
backup_roots[4]:
 

 

root@XP:~# btrfs ins dump-super -Ffs 274877906944 /dev/md3
superblock: bytenr=274877906944, device=/dev/md3
---------------------------------------------------------
btrfs: ctree.h:2183: btrfs_super_csum_size: Assertion `!(t >= (sizeof(btrfs_csum_sizes) / sizeof((btrfs_csum_sizes)[0])))' failed.
csum                    0xAborted (core dumped)
 

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Answer this question...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.