sublimnl Posted September 6, 2023 Share #1 Posted September 6, 2023 (edited) I've just recently migrated to xpenology with DSM 7.2 and things went seriously wrong while expanding a storage pool. Here is the run-down of what I have done so far... I was moving off on an unraid array with 4 disks to xpenology. I freed up one of the disks on my physical unraid machine and moved it to my new xpenology VM on ESXi 8. Attached that disk to my xpenology VM as an RDM (via a spare 4 port USB3 enclosure I had laying around), created a new SHR volume and copied all my data over. Freed another disk from unraid and moved it to the Xpenology VM, added it to the storage pool and waited for that to sync up - great, now I have a RAID1 array on xpenology with all my data. Feeling confident that my data was now protected in Xpenology, I moved the remaining 2 disks from my unraid array into Xpenology. Now I have all 4 disks in my USB3 enclosure. Each disk is mapped as an RDM to my xpenology VM. Went into DSM and started pool expansion using the 2 new disks that were just added. Expansion ran for 12 hours and was maybe 20% complete. At this point I started looking into why it was taking so long and found out that I had accidentally attached the enclosure to a USB 2.0 port. Woops. I did some reading and found that I could safely shutdown the xpenology VM via the shutdown option in DSM and it should resume expansion when powered back on. Shut it down, moved the enclosure to a USB3 port, remapped the RDM's, being careful to make sure they were attached to the VM on the same exact SATA addresses. Booted back up. As advertised, the expansion picked up where it left off and was chugging along much faster. It was now saying another 15 hours to finish, which I felt much better about. After about 2 hours I got an email saying disk 1 (the one I originally created the pool on) had crashed. This was in the middle of my work day, so I didn't have time to investigate right then and there. I did see that the expansion was still going, and I could still access my data, so at this point I crossed my fingers that it would complete and I would at least have an SHR pool with 3 disks at the end. A couple hours later, got another email saying the ENTIRE POOL had crashed. WTF. After some investigating I found that ESX had completely dropped the USB connection to the enclosure and I couldn't even see the devices anymore from ESX's perspective. I have now gotten the USB connection stable again, but I cannot even boot into DSM. I have noticed that I can ping the IP address of the VM while it is booting, but once it starts load the kernel pings drop again and never come back (supposedly because DSM never finishes initializing). I noticed this behavior previoulsy with the pings while the system was healthy. It seems like the system does an initial boot which brings networking online, then when DSM starts to load pings drop again until everything is up and running. So I guess that is normal behavior even on a healthy system? So it seems like something is going wrong when DSM is loading, but without networking I have no way to see whats going on inside the VM. I can see that CPU is steady at about 30% when this happens, like it is stuck in a loop of some sort. No disk activity is happening at this time. Any ideas on what steps I should take from here? Edited September 6, 2023 by sublimnl Quote Link to comment Share on other sites More sharing options...
sublimnl Posted September 6, 2023 Author Share #2 Posted September 6, 2023 (edited) OK, I managed to ssh into the xpenology VM. here is some output from there which can hopefully shed some light on what needs to be done next. It seems like it is still trying to reshape the array (46.5% (903431996/1942787584) finish=109615.1min speed=158K/sec) but I don't think its actually doing anything as it only says 158K/sec and I dont see any activity on the disks themselves. If I am interpreting the output correctly it looks like it thinks disk 0 is missing and only 1 and 2 are present even though disk 0 should be /dev/sdb5, which IS present and online. /dev/sdb is the disk that I originally copied all my data to (and then created a RAID1 with /dev/sdc5). /dev/sdb is also the disk that went offline during the expansion causing things to fall apart. I also see the reshape positions are different between the three disks, which doesn't seem like a great thing. Hoping someone out there can help 🙏 Spoiler # fdisk -l Disk /dev/sda: 4 GiB, 4294967296 bytes, 8388608 sectors Disk model: VMware Virtual S Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0xf110ee87 Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 149503 147456 72M 83 Linux /dev/sda2 149504 301055 151552 74M 83 Linux /dev/sda3 301056 8388607 8087552 3.9G 83 Linux Disk /dev/sdb: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors Disk model: VMware Virtual S Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0xdf5751bc Device Boot Start End Sectors Size Id Type /dev/sdb1 8192 16785407 16777216 8G fd Linux raid autodetect /dev/sdb2 16785408 20979711 4194304 2G fd Linux raid autodetect /dev/sdb3 21241856 3907027967 3885786112 1.8T f W95 Ext'd (LBA) /dev/sdb5 21257952 3906835231 3885577280 1.8T fd Linux raid autodetect Disk /dev/sde: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors Disk model: VMware Virtual S Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x950f1fed Device Boot Start End Sectors Size Id Type /dev/sde1 8192 16785407 16777216 8G fd Linux raid autodetect /dev/sde2 16785408 20979711 4194304 2G fd Linux raid autodetect /dev/sde3 21241856 3907027967 3885786112 1.8T f W95 Ext'd (LBA) /dev/sde5 21257952 3906835231 3885577280 1.8T fd Linux raid autodetect Disk /dev/sdc: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors Disk model: VMware Virtual S Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0xfb2b79ed Device Boot Start End Sectors Size Id Type /dev/sdc1 8192 16785407 16777216 8G fd Linux raid autodetect /dev/sdc2 16785408 20979711 4194304 2G fd Linux raid autodetect /dev/sdc3 21241856 3907027967 3885786112 1.8T f W95 Ext'd (LBA) /dev/sdc5 21257952 3906835231 3885577280 1.8T fd Linux raid autodetect Disk /dev/sdd: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors Disk model: VMware Virtual S Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x6466470a Device Boot Start End Sectors Size Id Type /dev/sdd1 8192 16785407 16777216 8G fd Linux raid autodetect /dev/sdd2 16785408 20979711 4194304 2G fd Linux raid autodetect /dev/sdd3 21241856 3907027967 3885786112 1.8T f W95 Ext'd (LBA) /dev/sdd5 21257952 3906835231 3885577280 1.8T fd Linux raid autodetect Spoiler # mdadm --examine /dev/sd[bcde]5 /dev/sdb5: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : e8adb5a7:6b89d8ea:fa38fa83:5679ef65 Name : Synology:3 Creation Time : Sat Sep 2 04:39:01 2023 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 3885575232 sectors (1852.79 GiB 1989.41 GB) Array Size : 3885575168 KiB (3.62 TiB 3.98 TB) Used Dev Size : 3885575168 sectors (1852.79 GiB 1989.41 GB) Data Offset : 2048 sectors Super Offset : 8 sectors Unused Space : before=1968 sectors, after=64 sectors State : active Device UUID : 25b67ffe:8fd2a125:53e01581:86f9f6c8 Reshape pos'n : 206338560 (196.78 GiB 211.29 GB) Delta Devices : 1 (2->3) Update Time : Mon Sep 4 09:10:04 2023 Checksum : 2e1c8576 - correct Events : 7508 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 0 Array State : AAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdc5: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : e8adb5a7:6b89d8ea:fa38fa83:5679ef65 Name : Synology:3 Creation Time : Sat Sep 2 04:39:01 2023 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 3885575232 sectors (1852.79 GiB 1989.41 GB) Array Size : 3885575168 KiB (3.62 TiB 3.98 TB) Used Dev Size : 3885575168 sectors (1852.79 GiB 1989.41 GB) Data Offset : 2048 sectors Super Offset : 8 sectors Unused Space : before=1968 sectors, after=64 sectors State : clean Device UUID : 45756199:fb7c404b:1618442d:7b222630 Reshape pos'n : 1806781824 (1723.08 GiB 1850.14 GB) Delta Devices : 1 (2->3) Update Time : Tue Sep 5 10:47:35 2023 Checksum : c0c67e67 - correct Events : 15527 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 1 Array State : .AA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdd5: Magic : a92b4efc Version : 1.2 Feature Map : 0x4 Array UUID : e8adb5a7:6b89d8ea:fa38fa83:5679ef65 Name : Synology:3 Creation Time : Sat Sep 2 04:39:01 2023 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 3885575232 sectors (1852.79 GiB 1989.41 GB) Array Size : 3885575168 KiB (3.62 TiB 3.98 TB) Used Dev Size : 3885575168 sectors (1852.79 GiB 1989.41 GB) Data Offset : 2048 sectors Super Offset : 8 sectors Unused Space : before=1968 sectors, after=64 sectors State : clean Device UUID : 69d57215:5811d191:b6ef3eaf:4cc976fb Reshape pos'n : 1806781824 (1723.08 GiB 1850.14 GB) Delta Devices : 1 (2->3) Update Time : Tue Sep 5 10:47:35 2023 Checksum : d0b3f15b - correct Events : 15527 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 2 Array State : .AA ('A' == active, '.' == missing, 'R' == replacing) mdadm: No md superblock detected on /dev/sde5. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 sdc5[1] sdd5[2] 1942787584 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/2] [_UU] [=========>...........] reshape = 46.5% (903431996/1942787584) finish=109615.1min speed=158K/sec unused devices: <none> # mdadm --detail /dev/md127 /dev/md127: Version : 1.2 Creation Time : Sat Sep 2 04:39:01 2023 Raid Level : raid5 Array Size : 1942787584 (1852.79 GiB 1989.41 GB) Used Dev Size : 1942787584 (1852.79 GiB 1989.41 GB) Raid Devices : 3 Total Devices : 2 Persistence : Superblock is persistent Update Time : Wed Sep 6 03:08:54 2023 State : clean, degraded, reshaping Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Consistency Policy : resync Reshape Status : 46% complete Delta Devices : 1, (2->3) Name : Synology:3 UUID : e8adb5a7:6b89d8ea:fa38fa83:5679ef65 Events : 15528 Number Major Minor RaidDevice State - 0 0 0 removed 1 8 37 1 active sync /dev/sdc5 2 8 53 2 active sync /dev/sdd5 Edited September 6, 2023 by sublimnl clarity Quote Link to comment Share on other sites More sharing options...
vbz14216 Posted September 7, 2023 Share #3 Posted September 7, 2023 That looks like a hot mess. I had a feeling that your USB3 enclosure is acting up from the heavy load generated by RAID reshaping. IMO I would stop tampering with the array, imaging the individual drives to raw images and see if the data can be mounted by ordinary Linux distros. Too bad I'm not an mdadm specialist, see if there are other specialists that can help you, good luck. Quote Link to comment Share on other sites More sharing options...
sublimnl Posted September 27, 2023 Author Share #4 Posted September 27, 2023 In case someone else runs across this topic in the future, I WAS ABLE TO GET EVERYTHING BACK. Had to use Recovery Explorer Pro to get the data back after the failed restriping. At first it seemed like it could not recover the data (but at least was able to get file names), so I reached out to their support team and they offered to have a look over TeamViewer. The guy took about 10 mins sussing out the drives, changing some recovery settings that were over my head despite working in all levels of IT Infrastructure for 25 years now, it was cool to watch him piece it back together. In the end, left it scanning overnight and I was able to fully recover everything. There were maybe 10 files that were corrupted out of over 300,000. Highly recommend these guys if someone else ends up in the same boat. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.