Have I lost everything?

sublimnl · September 6, 2023

I've just recently migrated to xpenology with DSM 7.2 and things went seriously wrong while expanding a storage pool. Here is the run-down of what I have done so far...

I was moving off on an unraid array with 4 disks to xpenology. I freed up one of the disks on my physical unraid machine and moved it to my new xpenology VM on ESXi 8.
Attached that disk to my xpenology VM as an RDM (via a spare 4 port USB3 enclosure I had laying around), created a new SHR volume and copied all my data over.
Freed another disk from unraid and moved it to the Xpenology VM, added it to the storage pool and waited for that to sync up - great, now I have a RAID1 array on xpenology with all my data.
Feeling confident that my data was now protected in Xpenology, I moved the remaining 2 disks from my unraid array into Xpenology. Now I have all 4 disks in my USB3 enclosure. Each disk is mapped as an RDM to my xpenology VM.
Went into DSM and started pool expansion using the 2 new disks that were just added.
Expansion ran for 12 hours and was maybe 20% complete. At this point I started looking into why it was taking so long and found out that I had accidentally attached the enclosure to a USB 2.0 port. Woops. I did some reading and found that I could safely shutdown the xpenology VM via the shutdown option in DSM and it should resume expansion when powered back on.
Shut it down, moved the enclosure to a USB3 port, remapped the RDM's, being careful to make sure they were attached to the VM on the same exact SATA addresses. Booted back up. As advertised, the expansion picked up where it left off and was chugging along much faster. It was now saying another 15 hours to finish, which I felt much better about.
After about 2 hours I got an email saying disk 1 (the one I originally created the pool on) had crashed. This was in the middle of my work day, so I didn't have time to investigate right then and there. I did see that the expansion was still going, and I could still access my data, so at this point I crossed my fingers that it would complete and I would at least have an SHR pool with 3 disks at the end.
A couple hours later, got another email saying the ENTIRE POOL had crashed. WTF. After some investigating I found that ESX had completely dropped the USB connection to the enclosure and I couldn't even see the devices anymore from ESX's perspective.

I have now gotten the USB connection stable again, but I cannot even boot into DSM. I have noticed that I can ping the IP address of the VM while it is booting, but once it starts load the kernel pings drop again and never come back (supposedly because DSM never finishes initializing). I noticed this behavior previoulsy with the pings while the system was healthy. It seems like the system does an initial boot which brings networking online, then when DSM starts to load pings drop again until everything is up and running. So I guess that is normal behavior even on a healthy system? So it seems like something is going wrong when DSM is loading, but without networking I have no way to see whats going on inside the VM.

I can see that CPU is steady at about 30% when this happens, like it is stuck in a loop of some sort. No disk activity is happening at this time.

Any ideas on what steps I should take from here?

Edited September 6, 2023 by sublimnl

sublimnl · September 6, 2023

OK, I managed to ssh into the xpenology VM. here is some output from there which can hopefully shed some light on what needs to be done next. It seems like it is still trying to reshape the array (46.5% (903431996/1942787584) finish=109615.1min speed=158K/sec) but I don't think its actually doing anything as it only says 158K/sec and I dont see any activity on the disks themselves. If I am interpreting the output correctly it looks like it thinks disk 0 is missing and only 1 and 2 are present even though disk 0 should be /dev/sdb5, which IS present and online. /dev/sdb is the disk that I originally copied all my data to (and then created a RAID1 with /dev/sdc5). /dev/sdb is also the disk that went offline during the expansion causing things to fall apart. I also see the reshape positions are different between the three disks, which doesn't seem like a great thing.

Hoping someone out there can help 🙏

Spoiler

# fdisk -l
Disk /dev/sda: 4 GiB, 4294967296 bytes, 8388608 sectors
Disk model: VMware Virtual S
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xf110ee87

Device     Boot  Start     End Sectors  Size Id Type
/dev/sda1  *      2048  149503  147456   72M 83 Linux
/dev/sda2       149504  301055  151552   74M 83 Linux
/dev/sda3       301056 8388607 8087552  3.9G 83 Linux


Disk /dev/sdb: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: VMware Virtual S
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xdf5751bc

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sdb1           8192   16785407   16777216    8G fd Linux raid autodetect
/dev/sdb2       16785408   20979711    4194304    2G fd Linux raid autodetect
/dev/sdb3       21241856 3907027967 3885786112  1.8T  f W95 Ext'd (LBA)
/dev/sdb5       21257952 3906835231 3885577280  1.8T fd Linux raid autodetect


Disk /dev/sde: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: VMware Virtual S
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x950f1fed

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sde1           8192   16785407   16777216    8G fd Linux raid autodetect
/dev/sde2       16785408   20979711    4194304    2G fd Linux raid autodetect
/dev/sde3       21241856 3907027967 3885786112  1.8T  f W95 Ext'd (LBA)
/dev/sde5       21257952 3906835231 3885577280  1.8T fd Linux raid autodetect


Disk /dev/sdc: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: VMware Virtual S
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xfb2b79ed

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sdc1           8192   16785407   16777216    8G fd Linux raid autodetect
/dev/sdc2       16785408   20979711    4194304    2G fd Linux raid autodetect
/dev/sdc3       21241856 3907027967 3885786112  1.8T  f W95 Ext'd (LBA)
/dev/sdc5       21257952 3906835231 3885577280  1.8T fd Linux raid autodetect


Disk /dev/sdd: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: VMware Virtual S
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x6466470a

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sdd1           8192   16785407   16777216    8G fd Linux raid autodetect
/dev/sdd2       16785408   20979711    4194304    2G fd Linux raid autodetect
/dev/sdd3       21241856 3907027967 3885786112  1.8T  f W95 Ext'd (LBA)
/dev/sdd5       21257952 3906835231 3885577280  1.8T fd Linux raid autodetect

Spoiler

# mdadm --examine /dev/sd[bcde]5
/dev/sdb5:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : e8adb5a7:6b89d8ea:fa38fa83:5679ef65
           Name : Synology:3
  Creation Time : Sat Sep  2 04:39:01 2023
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 3885575232 sectors (1852.79 GiB 1989.41 GB)
     Array Size : 3885575168 KiB (3.62 TiB 3.98 TB)
  Used Dev Size : 3885575168 sectors (1852.79 GiB 1989.41 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1968 sectors, after=64 sectors
          State : active
    Device UUID : 25b67ffe:8fd2a125:53e01581:86f9f6c8

  Reshape pos'n : 206338560 (196.78 GiB 211.29 GB)
  Delta Devices : 1 (2->3)

    Update Time : Mon Sep  4 09:10:04 2023
       Checksum : 2e1c8576 - correct
         Events : 7508

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 0
   Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc5:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : e8adb5a7:6b89d8ea:fa38fa83:5679ef65
           Name : Synology:3
  Creation Time : Sat Sep  2 04:39:01 2023
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 3885575232 sectors (1852.79 GiB 1989.41 GB)
     Array Size : 3885575168 KiB (3.62 TiB 3.98 TB)
  Used Dev Size : 3885575168 sectors (1852.79 GiB 1989.41 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1968 sectors, after=64 sectors
          State : clean
    Device UUID : 45756199:fb7c404b:1618442d:7b222630

  Reshape pos'n : 1806781824 (1723.08 GiB 1850.14 GB)
  Delta Devices : 1 (2->3)

    Update Time : Tue Sep  5 10:47:35 2023
       Checksum : c0c67e67 - correct
         Events : 15527

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 1
   Array State : .AA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd5:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : e8adb5a7:6b89d8ea:fa38fa83:5679ef65
           Name : Synology:3
  Creation Time : Sat Sep  2 04:39:01 2023
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 3885575232 sectors (1852.79 GiB 1989.41 GB)
     Array Size : 3885575168 KiB (3.62 TiB 3.98 TB)
  Used Dev Size : 3885575168 sectors (1852.79 GiB 1989.41 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1968 sectors, after=64 sectors
          State : clean
    Device UUID : 69d57215:5811d191:b6ef3eaf:4cc976fb

  Reshape pos'n : 1806781824 (1723.08 GiB 1850.14 GB)
  Delta Devices : 1 (2->3)

    Update Time : Tue Sep  5 10:47:35 2023
       Checksum : d0b3f15b - correct
         Events : 15527

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 2
   Array State : .AA ('A' == active, '.' == missing, 'R' == replacing)
mdadm: No md superblock detected on /dev/sde5.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdc5[1] sdd5[2]
      1942787584 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/2] [_UU]
      [=========>...........]  reshape = 46.5% (903431996/1942787584) finish=109615.1min speed=158K/sec

unused devices: <none>

# mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Sat Sep  2 04:39:01 2023
        Raid Level : raid5
        Array Size : 1942787584 (1852.79 GiB 1989.41 GB)
     Used Dev Size : 1942787584 (1852.79 GiB 1989.41 GB)
      Raid Devices : 3
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Wed Sep  6 03:08:54 2023
             State : clean, degraded, reshaping
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 64K

Consistency Policy : resync

    Reshape Status : 46% complete
     Delta Devices : 1, (2->3)

              Name : Synology:3
              UUID : e8adb5a7:6b89d8ea:fa38fa83:5679ef65
            Events : 15528

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       37        1      active sync   /dev/sdc5
       2       8       53        2      active sync   /dev/sdd5

Edited September 6, 2023 by sublimnl
clarity

vbz14216 · September 7, 2023

That looks like a hot mess. I had a feeling that your USB3 enclosure is acting up from the heavy load generated by RAID reshaping.

IMO I would stop tampering with the array, imaging the individual drives to raw images and see if the data can be mounted by ordinary Linux distros.

Too bad I'm not an mdadm specialist, see if there are other specialists that can help you, good luck.

sublimnl · September 27, 2023

In case someone else runs across this topic in the future, I WAS ABLE TO GET EVERYTHING BACK. Had to use Recovery Explorer Pro to get the data back after the failed restriping. At first it seemed like it could not recover the data (but at least was able to get file names), so I reached out to their support team and they offered to have a look over TeamViewer. The guy took about 10 mins sussing out the drives, changing some recovery settings that were over my head despite working in all levels of IT Infrastructure for 25 years now, it was cool to watch him piece it back together. In the end, left it scanning overnight and I was able to fully recover everything. There were maybe 10 files that were corrupted out of over 300,000. Highly recommend these guys if someone else ends up in the same boat.

Sign In

Have I lost everything?

Recommended Posts

sublimnl

Link to comment

Share on other sites

sublimnl

Link to comment

Share on other sites

vbz14216

Link to comment

Share on other sites

sublimnl

Link to comment

Share on other sites

Join the conversation

Forums

What's new

MUST READ

Members