Volume Crashed Looking to Recover Data

drcrow_ · July 3, 2019

I been trying to troubleshoot my volume crash myself but I am at the end of my wits here. I am hoping someone can shine some light on to what my issue is and how to fix it.

A couple weeks I started to receive email alerts stating, “Checksum mismatch on NAS. Please check Log Center for more details.” I hopped on my NAS WebUI and I did not really seem much in the logs. After checking my systems were still functioning properly and I could access my file, I figured something was wrong but was not a major issue…..how wrong I was.

That brings us up until today, where I notice my NAS was only in read only mode. Which I thought was really odd. I tried logging into the WebUI but after I entered my username and password, I was not getting the NAS’s dashboard.

I figured I would reboot the NAS, thinking it would fix the issue. I had problems with the WebUI being buggy in the past and a reboot seemed to always take care of it.

But after the reboot I received the dreaded email, “Volume 1 (SHR, btrfs) on NAS has crashed”. I am unable to access the WebUI. But luckily, I have SSH enabled and logged on to the server and that’s where we are now.

Some info about my system:

12 x 10TB Drives
Synology 6.1.X as a DS3617xs

1 SSD Cache

24 GBs of RAM

1 x XEON CPU

Here is the output of some of the commands I tried already: (Have to edit some of the outputs due to SPAM detection)

Looks like the RAID comes up as md2. Seems to have the 12 drives active, not 100% sure

Quote

ash-4.3# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [raidF1]
md2 : active raid6 sdb5[0] sdm5[11] sdl5[10] sdk5[9] sdj5[8] sdi5[7] sdh5[6] sdg5[5] sdf5[4] sde5[3] sdd5[2] sdc5[1]
97615989760 blocks super 1.2 level 6, 64k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]

md3 : active raid0 sda1[0]
250049664 blocks super 1.2 64k chunks [1/1]

md1 : active raid1 sdb2[0] sdc2[1] sdd2[2] sde2[3] sdf2[4] sdg2[5] sdh2[6] sdi2[7] sdj2[8] sdk2[9] sdl2[10] sdm2[11]
2097088 blocks [13/12] [UUUUUUUUUUUU_]

md0 : active raid1 sdb1[0] sdc1[2] sdd1[11] sde1[10] sdf1[8] sdg1[7] sdh1[6] sdi1[5] sdj1[4] sdk1[9] sdl1[3] sdm1[1]
2490176 blocks [12/12] [UUUUUUUUUUUU]

unused devices: <none>

Received an error when running the this command: GPT PMBR size mismatch (102399 != 60062499) will be corrected by w(rite). I think this might have to do something with the checksum errors I was getting before.

Quote

ash-4.3# fdisk -l
Disk /dev/sda: 238.5 GiB, 256060514304 bytes, 500118192 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xa9b8704b

Device Boot Start End Sectors Size Id Type
/dev/sda1 2048 500103449 500101402 238.5G fd Linux raid autodetect

GPT PMBR size mismatch (102399 != 60062499) will be corrected by w(rite).

Quote

ash-4.3# vgdisplay
--- Volume group ---
VG Name vg1000
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 2
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 1
Open LV 1
Max PV 0
Cur PV 1
Act PV 1
VG Size 90.91 TiB
PE Size 4.00 MiB
Total PE 23832028
Alloc PE / Size 23832028 / 90.91 TiB
Free PE / Size 0 / 0
VG UUID rc3DXE-ddO3-qaOp-7gLC-6wll-hesC-yC5YFE

Quote

ash-4.3# lvdisplay -v
Using logical volume(s) on command line.
--- Logical volume ---
LV Path /dev/vg1000/lv
LV Name lv
VG Name vg1000
LV UUID NUab2g-gp1H-bmCu-Vie0-1qmK-ougT-uNop9i
LV Write Access read/write
LV Creation host, time ,
LV Status available
# open 1
LV Size 90.91 TiB
Current LE 23832028
Segments 1
Allocation inherit
Read ahead sectors auto
currently set to 2560
Block device 253:0

When I try to interact with the LV it says it couldn't open file system.

Quote

ash-4.3# btrfs check /dev/vg1000/lv
Couldn't open file system

I tried to unmounted the LV and/or remount it, it gives me errors saying its not mounted, already mounted or busy.

Quote

ash-4.3# umount /dev/vg1000/lv
umount: /dev/vg1000/lv: not mounted
ash-4.3# mount -o recovery /dev/vg1000/lv /volume1
mount: /dev/vg1000/lv is already mounted or /volume1 busy

Can anyone comment on whether this is a possibility to recover the data? Am I going in the right direction?

Any help would be greatly appreciated!

Edited July 4, 2019 by Polanskiman
Cleaned html code.

flyride · July 3, 2019

You might want to review this thread: https://xpenology.com/forum/topic/14337-volume-crash-after-4-months-of-stability

And in particular, recovering files per post #14: https://xpenology.com/forum/topic/14337-volume-crash-after-4-months-of-stability/?do=findComment&comment=108021

Mostly btrfs tends to self-heal, but there are not a lot of easy options on Synology to fix a btrfs volume once it has corrupted. At least, none that are documented and functional.

Edited July 3, 2019 by flyride
fixed url

drcrow_ · July 3, 2019

Thanks for responding @flyride. I was actually using that post to help troubleshoot my issues. But I ran into issues with your recovering files comment. When I try to run:

Quote

sudo btrfs check --init-extent-tree /dev/vg1000/lv

sudo btrfs check --init-csum-tree /dev/vg1000/lv

sudo btrfs check --repair /dev/vg1000/lv

I get the following errors:

Quote

ash-4.3# btrfs check --init-extent-tree /dev/vg1000/lv
Couldn't open file system
ash-4.3# btrfs check --init-csum-tree /dev/vg1000/lv
Creating a new CRC tree
Couldn't open file system
ash-4.3# btrfs check --repair /dev/vg1000/lv
enabling repair mode
Couldn't open file system

I can't seem to interact with the LV. Got any other steps/commands I should try?

I was really hoping you would respond, since you seemed to help out the other guy in the thread you linked.

I think having the checksum error emails and then when I do a fdisk -l, I get GPT PMBR size mismatch (102399 != 60062499) will be corrected by w(rite). Makes me think the issue stems from there.

But I am willing to give anything you suggest a try!

flyride · July 3, 2019

Honestly, I don't think any of the repair options are likely to help. Your LV seems to be ok, but the FS is toast. The post I linked specifically discussed the recovery option. The FS does not have to mount in order to use that option to recover your files.

Your biggest challenge will be to find enough storage to perform the recovery. I would probably build up another NAS and NFS mount it to the problem server.

drcrow_ · July 3, 2019

I see what talking about in the other post, you mean this right?

Quote

Btrfs has a special option to dump the whole filesystem to completely separate location, even if the source cannot be mounted. So if you have a free SATA port, install an adequately sized drive, create a second storage pool and set up a second volume to use as a recovery target. Alternatively, you could build it on another NAS and NFS mount it. Whatever you come up with has to be directly accessible on the problem system.

For example's sake, let's say that you have installed and configured /volume2. This command should extract all the files from your broken btrfs filesystem and drop them on /volume2. Note that /volume2 can be set up as btrfs or ext4 - the filesystem type does not matter.

sudo btrfs restore /dev/vg1000/lv /volume2

1 hour ago, flyride said:

Your biggest challenge will be to find enough storage to perform the recovery. I would probably build up another NAS and NFS mount it to the problem server.

That is my biggest problem. I have another NAS with roughly 20TB of storage and a friends NAS with 16TB. Is there away to just restore the data not the entire array? Meaning, does the btrfs restore /dev/vg1000/lv /volume2 need to be as big as the entire volume, ~90TB, or just as big as the data I stored on it, ~35TB?

Additionally, is that all the info on Btrfs restore via this link, https://btrfs.wiki.kernel.org/index.php/Restore. I was hoping for some more information.

Ideally, I could use part of my 20TB NAS and part of my friends NAS.

BTW, thanks for your help so far. Seems kind of grim.

Edited July 3, 2019 by drcrow_

drcrow_ · July 3, 2019

I might of misspoke when I was looking at the size of my file. I am not sure. But when I do a mdadm -D /dev/md2 I get:

Quote

ash-4.3# mdadm -D /dev/md2
/dev/md2:
Version : 1.2
Creation Time : Mon Dec 17 21:03:41 2018
Raid Level : raid6
Array Size : 97615989760 (93093.86 GiB 99958.77 GB)
Used Dev Size : 9761598976 (9309.39 GiB 9995.88 GB)
Raid Devices : 12
Total Devices : 12
Persistence : Superblock is persistent

Update Time : Wed Jul 3 10:50:25 2019
State : clean
Active Devices : 12
Working Devices : 12
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Name : NAS:2 (local to host NAS)
UUID : ddd4bb0a:74d2802b:7bc45a50:a8e617eb
Events : 571

Number Major Minor RaidDevice State
0 8 21 0 active sync /dev/sdb5
1 8 37 1 active sync /dev/sdc5
2 8 53 2 active sync /dev/sdd5
3 8 69 3 active sync /dev/sde5
4 8 85 4 active sync /dev/sdf5
5 8 101 5 active sync /dev/sdg5
6 8 117 6 active sync /dev/sdh5
7 8 133 7 active sync /dev/sdi5
8 8 149 8 active sync /dev/sdj5
9 8 165 9 active sync /dev/sdk5
10 8 181 10 active sync /dev/sdl5
11 8 197 11 active sync /dev/sdm5

Do you know if the Used Dev Size means the size used? That would mean I only have used roughly 9.99TB. Which would would in terms of using my NAS with 20TB.

I just want to confirm your thoughts before I do ahead and try to run the command.

flyride · July 3, 2019

The restore only needs enough room to save the files stored on the volume, not the entire volume size, so potentially good news there.

"Used dev" in mdadm output on the previous post refers to the size of the parity device in the array... nothing to do with the usage of the volume.

drcrow_ · July 3, 2019

Is there away I can see how much data I have on the volume to make sure I have enough space on my other array?

flyride · July 3, 2019

Maybe, but I don't know how to do that without mounting it. It will take a long time and it will write regular files to the destination, so you could probably move stuff off well in time to make room if you don't have quite enough space.

Edited July 3, 2019 by flyride

drcrow_ · July 4, 2019

@flyride Thanks for all the help. I think I was able to get the rough size of my volume, ~21.61TB. You can see in the output below, bytes_used 21619950809088. Hopefully that is the right amount.

Quote

ash-4.3# btrfs inspect-internal dump-super -f /dev/vg1000/lv
superblock: bytenr=65536, device=/dev/vg1000/lv
---------------------------------------------------------
csum 0xd294ff59 [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
fsid dcb0f8f7-432e-4039-8f60-e43cdb417df1
label 2018.12.18-02:03:43 v15217
generation 263749
root 1401334054912
sys_array_size 129
chunk_root_generation 255129
root_level 1
chunk_root 21004288
chunk_root_level 1
log_root 1401336610816
log_root_transid 0
log_root_level 0
total_bytes 99958770368512
bytes_used 21619950809088
sectorsize 4096
nodesize 16384
leafsize 16384
stripesize 4096
root_dir 6
num_devices 1
compat_flags 0x8000000000000000
compat_ro_flags 0x0
incompat_flags 0x16b

I have amounted my other NAS and I am running btrfs restore /dev/vg1000/lv /root/hope/. Where /root/hope/ is my remote NFS mount to my other NAS. One question I did have is on the btrfs restore command they mention you can use the following flag:

Quote

--path-regex: Regex for files to restore. In order to restore only a single folder somewhere in the btrfs tree, it is unfortunately necessary to construct a slightly nontrivial regex, e.g.: '^/(|home(|/username(|/Desktop(|/.*))))$'

Now I am not a regex wizard, but my file path on my NAS when something like /Volume1/Media/TV, /Volume1/Media/Movies, /Volume1/Media/Home Videos. Let's say I just wanted to restore my movies folder, I think the regex should be ^/(|Media(|/Movies(|/.*)))$. But when I tried to do a dry run with that, btrfs restore -D --path-regex '^/(|Media(|/Movies(|/.*)))$' /dev/vg1000/lv /root/hope/, it did not seem to work.

Do you know if there is something wrong with my syntax?

drcrow_ · July 4, 2019

So I ran the btrfs restore /dev/vg1000/lv /root/hope but it got about 250GBs in and hit this error. Tried googling it and nothing really came up. Do you have any idea whats going on?

Maybe the regex file path would help by skipping this section. It looks like its just the recycle bin anyways, not something I care about.

Any chance you know how to fix the regex?

fa2k · July 4, 2019

i had a similar error from last crash. everything was green except the volume, so i have mounted it read only directly to the volume path and rebooted after that. Maybe it will help you too..

mount -t btrfs -o recovery,ro /dev/vg1000/lv /volumeX

<- X stands for the volume number it should be

Volume Crashed Looking to Recover Data

Question

Link to comment

Share on other sites

11 answers to this question

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation