• 0

Volume Crash after 4 months of stability


Question

Hello,

 

I am trying to recover about 5tb of data off of a 20tb volume on my xpenology system. The volume crashed while the system was under normal operating conditions. I have the system on battery backup, so power failure is not a likely cause. I have done a fair bit of research on how to accomplish the recovery, but I am running into a roadblock on what I expect is my potential solution. If I can get so pointers in the right direction I would really appreciate it.  So I'll just drop some details here:

 

fdisk -l

 cat /proc/mdstat

mdadm --assemble --scan -v

 

*These results contain numbers that get caught up in the forum filter so I removed them. If there is a way to post the results without it being flagged I will*

 

I have one volume with the default name of "volume1" located at /dev/mapper/vg1000-lv or /dev/vg1000/lv  I am not super familiar with how all of this works, but my research tells me that I can be using some repair options on the volume to correct the errors that caused the system to crash in the first place. These repair options are where I run into my trouble. Here is what I am talking about:

 

fsck.ext4 -v /dev/vg1000/lv

$ sudo fsck.ext4 -v /dev/vg1000/lv
e2fsck 1.42.6 (21-Sep-2012)
ext2fs_open2: Bad magic number in super-block
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/vg1000/lv

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

 

I think this is telling me that I have a bad super-block. When it goes to repair with a backup super-block it has some trouble. This could mean that I am dead in the water, but I think the method of finding some alternative blocks for repair is as follows:

 

mke2fs -n  /dev/vg1000/lv

 

sudo mke2fs -n  /dev/vg1000/lv
mke2fs 1.42.6 (21-Sep-2012)
mke2fs: Size of device (0x13ff8e000 blocks) /dev/vg1000/lv too big to be expressed
        in 32 bits using a blocksize of 4096.

On a typical system this lists out some blocks that are suitable replacements, but I have a 20tb volume! When I created the volume, I did not know that there was a pseudo 16tb limit on the ext4 system. Now I do. This is where my predicament is. I think there is a way to run this if I do some stuff I don't quite understand. For instance, this post (redacted because spam filter)

 

Can anyone comment on whether this is a possibility? Am I going in the right direction?

 

Any help would be greatly appreciated!

 

Dom

Edited by Donkey545
More info added
Link to post
Share on other sites

19 answers to this question

Recommended Posts

  • 0

You want to troubleshoot your crash from the lowest (atomic) level progressing to the highest level.  You are starting at the highest level (the filesystem) which is the wrong end of the problem.

 

Generally you don't want to force things as they can render your system unrecoverable.  I have no idea if your data can be recovered at this point but you can troubleshoot until you run into something that can't be resolved.

 

Start by posting screen captures of Storage Manager, both the HDD/SDD screen and the RAID Group screen.  Are all your drives present and functional?

Then run a cat /proc/mdstat and post the output.

Link to post
Share on other sites
  • 0
24 minutes ago, flyride said:

You want to troubleshoot your crash from the lowest (atomic) level progressing to the highest level.  You are starting at the highest level (the filesystem) which is the wrong end of the problem.

 

Generally you don't want to force things as they can render your system unrecoverable.  I have no idea if your data can be recovered at this point but you can troubleshoot until you run into something that can't be resolved.

 

Start by posting screen captures of Storage Manager, both the HDD/SDD screen and the RAID Group screen.  Are all your drives present and functional?

Then run a cat /proc/mdstat and post the output.

 

 

Hi, Thanks for the response, I appreciate the pointer on the direction to go with this. I am quite inexperienced with this type of troubleshooting. In my original post I had posted the cat /proc/mdstat  results but there were numbers in there that flagged the content filter as phone numbers or something. I will try it again. 

 

All of my disks are healthy. Unfortunately I went to back up the data while the volume was still showing its original mounted size and it has populated it with an empty volume after the reboot with my spare drive. I have since disconnected my spare.

 

image.thumb.png.fcd0cdb7a509b8331a1de82697ace339.png

 

image.thumb.png.3aff00d6d0119366b189c54b4926a47b.png

 

image.thumb.png.11b69cb2225f808dafc987541b8f62c4.png

 

image.thumb.png.35e0fd5f307f818dde78c8dc8ef1bc2e.png

 

$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [raidF1]
md2 : active raid5 sdc5[0] sdf5[3] sde5[2] sdd5[1]
      17567074368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md3 : active raid1 sdd6[0] sde6[1]
      3905898432 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdc2[0] sdd2[1] sde2[2] sdf2[3]
      2097088 blocks [12/4] [UUUU________]

md0 : active raid1 sdc1[0] sdd1[1] sde1[2] sdf1[3]
      2490176 blocks [12/4] [UUUU________]

unused devices: <none>

 

Link to post
Share on other sites
  • 0

Ok, your arrays seem healthy.  It's strange that you have a filesystem (volume) crash without any array corruption.

 

What exactly was happening when the volume crashed?  When you say it has "populated with an empty volume" you mean that it shows 0 bytes?  I suspect it just isn't mounting it.

 

Run a vgdisplay and post the results please.

 

Link to post
Share on other sites
  • 0
24 minutes ago, flyride said:

Ok, your arrays seem healthy.  It's strange that you have a filesystem (volume) crash without any array corruption.

 

What exactly was happening when the volume crashed?  When you say it has "populated with an empty volume" you mean that it shows 0 bytes?  I suspect it just isn't mounting it.

 

Run a vgdisplay and post the results please.

 

 

Here are the results of the vgdisplay:

 

/$ sudo vgdisplay
  --- Volume group ---
  VG Name               vg1000
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               0
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               20.00 TiB
  PE Size               4.00 MiB
  Total PE              5242424
  Alloc PE / Size       5242424 / 20.00 TiB
  Free  PE / Size       0 / 0
  VG UUID               uhqTDU-bBfp-Qkj0-BF2r-Hxru-XojL-ZwKdkW

 

On my system, I had a few docker images running, plex, and I had recently copied a large file (25+gb) to a folder on the volume. I have since removed docker and most additional packages as part of the debugging process. The only thing that I saw out of the ordinary was a message about a bad checksum on a file about three days before the crash. I had planned to delete the file and replace it.

 

Here is an excerpt from the log. Personal details redacted

 

Information	System	2019/03/05 20:28:43	SYSTEM	Local UPS was plugged in.
Information	System	2019/03/05 20:23:49	Donkey545	System started counting down to reboot.
Error	System	2019/03/05 20:05:04	SYSTEM	Volume [1] was crashed.
Information	System	2019/03/05 00:35:10	SYSTEM	Rotate Mail Log.
Information	System	2019/03/04 20:56:17	SYSTEM	System successfully registered [redacted] to [redacted.org] in DDNS server [USER_dynu.com].
Information	System	2019/03/04 20:56:17	SYSTEM	System successfully registered [redacted] to [redacted] in DDNS server [USER_duckdns].
Information	System	2019/03/04 00:35:09	SYSTEM	Rotate Mail Log.
Information	System	2019/03/03 20:56:17	SYSTEM	System successfully registered [redacted] to [redacted.org] in DDNS server [USER_dynu.com].
Information	System	2019/03/03 20:56:17	SYSTEM	System successfully registered [redacted] to [redacted] in DDNS server [USER_duckdns].
Warning	System	2019/03/03 03:25:20	SYSTEM	Checksum mismatch on file [/volume1/Media/Media_Archive/1080p/redacted].
Information	System	2019/03/03 00:35:09	SYSTEM	Rotate Mail Log.

 

Yes, the volume is unmounted. That is what I meant when I said it shows up as 0bytes.

Edited by Donkey545
clarification
Link to post
Share on other sites
  • 0

So FWIW this proves that your volume is not subject to a 16TB limitation (only applicable to ext4 created with 32-bit OS).

Do you know if your filesystem was btrfs or ext4?

/$ sudo vgdisplay
  --- Volume group ---
  VG Name               vg1000
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1  <----------- this looks like a problem to me, I think there should be 2 LVs

Post the output of vgdisplay --verbose and lvdisplay --verbose

 

 

Edited by flyride
Link to post
Share on other sites
  • 0
12 hours ago, flyride said:

So FWIW this proves that your volume is not subject to a 16TB limitation (only applicable to ext4 created with 32-bit OS).

Do you know if your filesystem was btrfs or ext4?


/$ sudo vgdisplay
  --- Volume group ---
  VG Name               vg1000
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1  <----------- this looks like a problem to me, I think there should be 2 LVs

Post the output of vgdisplay --verbose and lvdisplay --verbose

 

 

 

That is interesting, and it explains why I could create the volume in the first place. 

 

I think that it was BTRFS, but I am not sure honestly. For some reason in my research I neglected to make sure I was looking for that detail. This suggests that btrfs is correct:

 

cat /etc/fstab

/$ cat /etc/fstab
none /proc proc defaults 0 0
/dev/root / ext4 defaults 1 1
/dev/vg1000/lv /volume1 btrfs auto_reclaim_space,synoacl,relatime 0 0

 

 

lvdisplay -v

$ sudo lvdisplay -v
    Using logical volume(s) on command line.
  --- Logical volume ---
  LV Path                /dev/vg1000/lv
  LV Name                lv
  VG Name                vg1000
  LV UUID                D0NEYj-yD1r-CGw0-AKW2-nSFa-Mu72-WLZKYj
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 0
  LV Size                20.00 TiB
  Current LE             5242424
  Segments               3
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           253:0

 

vgdisplay -v

$ sudo vgdisplay -v
    Using volume group(s) on command line.
  --- Volume group ---
  VG Name               vg1000
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               0
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               20.00 TiB
  PE Size               4.00 MiB
  Total PE              5242424
  Alloc PE / Size       5242424 / 20.00 TiB
  Free  PE / Size       0 / 0
  VG UUID               uhqTDU-bBfp-Qkj0-BF2r-Hxru-XojL-ZwKdkW

  --- Logical volume ---
  LV Path                /dev/vg1000/lv
  LV Name                lv
  VG Name                vg1000
  LV UUID                D0NEYj-yD1r-CGw0-AKW2-nSFa-Mu72-WLZKYj
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 0
  LV Size                20.00 TiB
  Current LE             5242424
  Segments               3
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           253:0

  --- Physical volumes ---
  PV Name               /dev/md2
  PV UUID               Fosmy0-AkIR-naPA-Ez51-3yId-2AXZ-YxGJdy
  PV Status             allocatable
  Total PE / Free PE    4288836 / 0

  PV Name               /dev/md3
  PV UUID               jxYbss-ZVGK-RhTc-z4b5-QAc7-qCie-bH250G
  PV Status             allocatable
  Total PE / Free PE    953588 / 0

 

Thanks for helping me through this, I really appreciate it. Is there a reason that you think the missing LV is the issue?

Edited by Donkey545
Link to post
Share on other sites
  • 0

Here is some check info on the volume:

 

sudo btrfs check  /dev/vg1000/lv
Password:
Syno caseless feature on.
Checking filesystem on /dev/vg1000/lv
UUID: b27120c9-a2af-45d6-8e3b-05e2f31ba568
checking extents
checking free space tree
free space info recorded 619 extents, counted 620
Wanted bytes 999424, found 303104 for off 5706027515904
Wanted bytes 402767872, found 303104 for off 5706027515904
cache appears valid but isnt 5705893412864
found 5651223941120 bytes used err is -22
total csum bytes: 5313860548
total tree bytes: 6601883648
total fs tree bytes: 703512576
total extent tree bytes: 105283584
btree space waste bytes: 532066391
file data blocks allocated: 97865400832000
 referenced 5644258041856

 

Link to post
Share on other sites
  • 0
3 hours ago, Donkey545 said:

Is there a reason that you think the missing LV is the issue?

 

Syno has several different historical methods of building and naming LVs.  I think this one must have been built on 6.1.x because it doesn't use the current design.  The current one has a "syno_vg_reserved_area" 2nd LV when a volume is created.  But yours may be complete with only one LV.

 

It should not be necessary, but make sure you have activated all your LV's by executing

sudo vgchange -ay

 

At this point I think we have validated all the way up to the filesystem. 

 

3 hours ago, Donkey545 said:

I think that it was BTRFS, but I am not sure honestly. For some reason in my research I neglected to make sure I was looking for that detail. This suggests that btrfs is correct

 

Yep, you are using btrfs.  Therefore the tools (fsck, mke2fs)  you were trying to use at the beginning of the post are not correct.  Those are appropriate for the ext4 filesystem only.  NOTE: Btrfs has a lot of redundancy built into it so running a filesystem check using btrfs check --repair is generally not recommended.

 

Before you do anything else, just be absolutely certain you cannot manually mount the filesystem via

sudo mount /dev/vg1000/lv /volume1

 

If that doesn't work, try mounting without the filesystem cache

sudo mount -o clear_cache /dev/vg1000/lv /volume1

 

If that doesn't work, try a recovery mount

sudo mount -o recovery /dev/vg1000/lv /volume1

 

If any of these mount options work, copy all the data off that you can, delete your volume and rebuild it.

If not, report the failure information.

Edited by flyride
Link to post
Share on other sites
  • 0

It looks like I still have some issues. The mounting fails with each of these option.

 

First I do the vchange -ay:

 

sudo vgchange -ay
  1 logical volume(s) in volume group "vg1000" now active

 

Then a normal mount:

sudo mount /dev/vg1000/lv /volume1
mount: wrong fs type, bad option, bad superblock on /dev/vg1000/lv,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

 

Then a clear cache:

 

sudo mount -o clear_cache /dev/vg1000/lv /volume1
mount: wrong fs type, bad option, bad superblock on /dev/vg1000/lv,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

 

Then a recovery:

 

sudo mount -o recovery /dev/vg1000/lv /volume1
mount: wrong fs type, bad option, bad superblock on /dev/vg1000/lv,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

 

 

I have included the dmesg text as an attachment. I am seeing some errors like:

BTRFS error (device dm-0): incorrect extent count for 5705893412864; counted 619, expected 618


 

 

Xpenology dmesg.txt

Link to post
Share on other sites
  • 0

A few more things to try.  But before that, do you have another disk on which you could recover the 4TB of files?  If so, that might be worth preparing for.

 

First, attempt one more mount:

sudo mount -o recovery,ro /dev/vg1000/lv /volume1

 

If it doesn't work, do all of these:

sudo btrfs rescue super /dev/vg1000/lv

sudo btrfs-find-root /dev/vg1000/lv

sudo btrfs insp dump-s -f /dev/vg1000/lv

 

Post results of all.  We are getting to a point where the only options are a check repair (which doesn't really work on Syno because of kernel/utility mismatches) or a recovery to another disk as mentioned above.

Edited by flyride
Link to post
Share on other sites
  • 0
15 hours ago, flyride said:

A few more things to try.  But before that, do you have another disk on which you could recover the 4TB of files?  If so, that might be worth preparing for.

 

First, attempt one more mount:

sudo mount -o recovery,ro /dev/vg1000/lv /volume1

 

If it doesn't work, do all of these:

sudo btrfs rescue super /dev/vg1000/lv

sudo btrfs-find-root /dev/vg1000/lv

sudo btrfs insp dump-s -f /dev/vg1000/lv

 

Post results of all.  We are getting to a point where the only options are a check repair (which doesn't really work on Syno because of kernel/utility mismatches) or a recovery to another disk as mentioned above.

 

I will be testing out these options later today. I have four more 6tb drives that I had purchased to do a backup of this NAS, so I have the space available. Thanks for the help!

Link to post
Share on other sites
  • 0

First, attempt one more mount:

sudo mount -o recovery,ro /dev/vg1000/lv /volume1

 

/$ sudo mount -o recovery,ro /dev/vg1000/lv /volume1
mount: wrong fs type, bad option, bad superblock on /dev/vg1000/lv,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

Excerpt of dmesg:

dmesg

[249007.103992] BTRFS: error (device dm-0) in convert_free_space_to_extents:456: errno=-5 IO failure
[249007.112998] BTRFS: error (device dm-0) in add_to_free_space_tree:1049: errno=-5 IO failure
[249007.121572] BTRFS: error (device dm-0) in __btrfs_free_extent:6829: errno=-5 IO failure
[249007.129829] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2970: errno=-5 IO failure
[249007.138445] BTRFS: error (device dm-0) in open_ctree:3576: errno=-5 IO failure (Failed to recover log tree)
[249007.148704] BTRFS error (device dm-0): cleaner transaction attach returned -30
[249007.181327] BTRFS: open_ctree failed

 

With the superblock rescue, we actually get some good news! Well maybe haha

 

sudo btrfs rescue super /dev/vg1000/lv

 

sudo btrfs rescue super /dev/vg1000/lv
All supers are valid, no need to recover

I am attaching the result of the find  root as a text file due to the length of the output. The first few lines is included below:

 

sudo btrfs-find-root /dev/vg1000/lv

 

/$ sudo btrfs-find-root /dev/vg1000/lv
Superblock thinks the generation is 781165
Superblock thinks the level is 1
Found tree root at 5739883905024 gen 781165 level 1
Well block 5739820105728(gen: 781164 level: 1) seems good, but generation/level doesn't match, want gen: 781165 level: 1
Well block 5739770773504(gen: 781163 level: 1) seems good, but generation/level doesn't match, want gen: 781165 level: 1
Well block 5739769020416(gen: 781155 level: 0) seems good, but generation/level doesn't match, want gen: 781165 level: 1
Well block 5739768217600(gen: 781155 level: 0) seems good, but generation/level doesn't match, want gen: 781165 level: 1
Well block 5739782979584(gen: 781108 level: 0) seems good, but generation/level doesn't match, want gen: 781165 level: 1
Well block 5739817582592(gen: 781091 level: 1) seems good, but generation/level doesn't match, want gen: 781165 level: 1
Well block 5739819106304(gen: 781071 level: 1) seems good, but generation/level doesn't match, want gen: 781165 level: 1
Well block 5739787206656(gen: 781069 level: 1) seems good, but generation/level doesn't match, want gen: 781165 level: 1

 

When I try to  run your dump command I got back an error on the file open:

sudo btrfs insp dump-s -f /dev/vg1000/lv
ERROR: cannot open /dev/vg1000/lv: No such file or directory

So I ran what I think is the same command just written differently:

 

sudo btrfs inspect-internal dump-super -f /dev/vg1000/lv
 

sudo btrfs inspect-internal dump-super -f /dev/vg1000/lv
superblock: bytenr=65536, device=/dev/vg1000/lv
---------------------------------------------------------
csum                    0xb6222977 [match]
bytenr                  65536
flags                   0x1
                        ( WRITTEN )
magic                   _BHRfS_M [match]
fsid                    b27120c9-a2af-45d6-8e3b-05e2f31ba568
label                   2018.11.24-06:48:51 v23739
generation              781165
root                    5739883905024
sys_array_size          129
chunk_root_generation   779376
root_level              1
chunk_root              20987904
chunk_root_level        1
log_root                5739948392448
log_root_transid        0
log_root_level          0
total_bytes             21988319952896
bytes_used              5651223941120
sectorsize              4096
nodesize                16384
leafsize                16384
stripesize              4096
root_dir                6
num_devices             1
compat_flags            0x8000000000000000
compat_ro_flags         0x3
                        ( FREE_SPACE_TREE |
                          FREE_SPACE_TREE_VALID )
incompat_flags          0x16b
                        ( MIXED_BACKREF |
                          DEFAULT_SUBVOL |
                          COMPRESS_LZO |
                          BIG_METADATA |
                          EXTENDED_IREF |
                          SKINNY_METADATA )
csum_type               0
csum_size               4
cache_generation        18446744073709551615
uuid_tree_generation    781165
dev_item.uuid           c0444440-b78e-4432-aa99-15d7d6d43e5b
dev_item.fsid           b27120c9-a2af-45d6-8e3b-05e2f31ba568 [match]
dev_item.type           0
dev_item.total_bytes    21988319952896
dev_item.bytes_used     5857278427136
dev_item.io_align       4096
dev_item.io_width       4096
dev_item.sector_size    4096
dev_item.devid          1
dev_item.dev_group      0
dev_item.seek_speed     0
dev_item.bandwidth      0
dev_item.generation     0
sys_chunk_array[2048]:
        item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520)
                chunk length 8388608 owner 2 stripe_len 65536
                type SYSTEM|DUP num_stripes 2
                        stripe 0 devid 1 offset 20971520
                        dev uuid: c0444440-b78e-4432-aa99-15d7d6d43e5b
                        stripe 1 devid 1 offset 29360128
                        dev uuid: c0444440-b78e-4432-aa99-15d7d6d43e5b
backup_roots[4]:
        backup 0:
                backup_tree_root:       5739770773504   gen: 781163     level: 1
                backup_chunk_root:      20987904        gen: 779376     level: 1
                backup_extent_root:     5739782307840   gen: 781164     level: 2
                backup_fs_root:         29638656        gen: 8  level: 0
                backup_dev_root:        5728600883200   gen: 779376     level: 1
                backup_csum_root:       5739776557056   gen: 781164     level: 3
                backup_total_bytes:     21988319952896
                backup_bytes_used:      5651224502272
                backup_num_devices:     1

        backup 1:
                backup_tree_root:       5739820105728   gen: 781164     level: 1
                backup_chunk_root:      20987904        gen: 779376     level: 1
                backup_extent_root:     5739782307840   gen: 781164     level: 2
                backup_fs_root:         29638656        gen: 8  level: 0
                backup_dev_root:        5728600883200   gen: 779376     level: 1
                backup_csum_root:       5739820466176   gen: 781165     level: 3
                backup_total_bytes:     21988319952896
                backup_bytes_used:      5651223916544
                backup_num_devices:     1

        backup 2:
                backup_tree_root:       5739883905024   gen: 781165     level: 1
                backup_chunk_root:      20987904        gen: 779376     level: 1
                backup_extent_root:     5739912200192   gen: 781166     level: 2
                backup_fs_root:         29638656        gen: 8  level: 0
                backup_dev_root:        5728600883200   gen: 779376     level: 1
                backup_csum_root:       5739903025152   gen: 781166     level: 3
                backup_total_bytes:     21988319952896
                backup_bytes_used:      5651222638592
                backup_num_devices:     1

        backup 3:
                backup_tree_root:       5739986468864   gen: 781162     level: 1
                backup_chunk_root:      20987904        gen: 779376     level: 1
                backup_extent_root:     5739981111296   gen: 781162     level: 2
                backup_fs_root:         29638656        gen: 8  level: 0
                backup_dev_root:        5728600883200   gen: 779376     level: 1
                backup_csum_root:       5739997544448   gen: 781163     level: 3
                backup_total_bytes:     21988319952896
                backup_bytes_used:      5651223896064
                backup_num_devices:     1

 

 

Now this may be a dumb question, but what is my best method for backing up the data? should I connect directly to SATA, USB to Sata, or Configure my other NAS and use my network?

 

 

Xpenology btrfs-find-root.txt

Link to post
Share on other sites
  • 0
5 hours ago, Donkey545 said:

dmesg


[249007.103992] BTRFS: error (device dm-0) in convert_free_space_to_extents:456: errno=-5 IO failure
[249007.112998] BTRFS: error (device dm-0) in add_to_free_space_tree:1049: errno=-5 IO failure
<snip>
[249007.181327] BTRFS: open_ctree failed

sudo btrfs inspect-internal dump-super -f /dev/vg1000/lv


sudo btrfs inspect-internal dump-super -f /dev/vg1000/lv
superblock: bytenr=65536, device=/dev/vg1000/lv
<snip>
compat_flags            0x8000000000000000
compat_ro_flags         0x3
                        ( FREE_SPACE_TREE |
                          FREE_SPACE_TREE_VALID )
incompat_flags          0x16b
                        ( MIXED_BACKREF |
                          DEFAULT_SUBVOL |
                          COMPRESS_LZO |
                          BIG_METADATA |
                          EXTENDED_IREF |
                          SKINNY_METADATA )

 

 

Ok, there is good information here.  The kernel output (dmesg) tells us that you probably have a hardware failure that is not a bad sector.  Possibly a disk problem, possibly a bad SATA cable, or a disk controller failure which has caused some erroneous information to be written to the disk.  

 

It's extremely likely that a btrfs check would fix your problem. In particular I would be try these repair commands in this order, with an attempt to mount after each has completed:

sudo btrfs check --init-extent-tree /dev/vg1000/lv

sudo btrfs check --init-csum-tree /dev/vg1000/lv

sudo btrfs check --repair /dev/vg1000/lv

 

HOWEVER: Because the btrfs filesystem has incompatibility flags reported via the inspect-internal dump, I believe you won't be able to run any of the commands, and they will error out with "couldn't open RDWR because of unsupported option features." The flags tell us that the way the btrfs filesystem has been built, the btrfs code in the kernel and the btrfs tools are not totally compatible with each other.  I think Synology may use its own custom binaries to generate the btrfs filesystem and don't intend for the standard btrfs tools to be used to maintain it.  They may have their own maintenance tools that we don't know about, or they may only load the tools when providing remote support.

 

It might be possible to compile the latest versions of the btrfs tools against a Synology kernel source and get them to work.  If it were me in your situation I would try that. It's actually been on my list of things to do when I get some time, and if it works I will post them for download. The other option is to build up a Linux system, install the latest btrfs, connect your drives to it and run btrfs tools from there.  Obviously both of these choices are fairly complex to execute.

 

In summary, I think the filesystem is largely intact and the check options above would fix it.  But in lieu of a working btrfs check option, consider this alternative, which definitely does work on Synology btrfs:

 

Execute a btrfs recovery.

 

5 hours ago, Donkey545 said:

Now this may be a dumb question, but what is my best method for backing up the data? should I connect directly to SATA, USB to Sata, or Configure my other NAS and use my network?

 

Btrfs has a special option to dump the whole filesystem to completely separate location, even if the source cannot be mounted.  So if you have a free SATA port, install an adequately sized drive, create a second storage pool and set up a second volume to use as a recovery target.  Alternatively, you could build it on another NAS and NFS mount it.  Whatever you come up with has to be directly accessible on the problem system.

 

For example's sake, let's say that you have installed and configured /volume2.  This command should extract all the files from your broken btrfs filesystem and drop them on /volume2.  Note that /volume2 can be set up as btrfs or ext4 - the filesystem type does not matter.

sudo btrfs restore /dev/vg1000/lv /volume2

 

FMI: https://btrfs.wiki.kernel.org/index.php/Restore

Doing this restore is probably a good safety option even if you are able to successfully execute some sort of repair on the broken filesystem with btrfs check.

 

I'm just about out of advice at this point. You do a very good job of pulling together relevant logs, however.  If you encounter something interesting, post it.  Otherwise, good luck!

Edited by flyride
Link to post
Share on other sites
  • 0

Hey flyride,

 

I really appreciate the help. I have not gone through any of these last options just yet, but I will be attempting a restore to my other NAS soon.

 

On 3/9/2019 at 1:07 AM, flyride said:

 

It might be possible to compile the latest versions of the btrfs tools against a Synology kernel source and get them to work.  If it were me in your situation I would try that. It's actually been on my list of things to do when I get some time, and if it works I will post them for download. The other option is to build up a Linux system, install the latest btrfs, connect your drives to it and run btrfs tools from there.  Obviously both of these choices are fairly complex to execute.

 

 

I think at this point making a new OS drive wouldn't be too much more work if it yields better results than the Synology system. I will investigate this option and follow up if I use it.

 

On 3/9/2019 at 1:07 AM, flyride said:

I'm just about out of advice at this point. You do a very good job of pulling together relevant logs, however.  If you encounter something interesting, post it.  Otherwise, good luck!

 

Again, I really appreciate your help. Your method of investigating the issue has taught me a great deal about how issues like these can be addressed, and even more about how I can be more diligent in troubleshooting my own issues in the future. You rock!

Link to post
Share on other sites
  • 0

I was able to recover my data with a Restore. None of the repair options worked as you expected and each yielded the incompatibility flag you spoke of. With the recovery, I configured my other NAS with a large enough volume for all of the data and mounted it as a NFS. For any future googlers, the destination Volume for a BTRFS Recover does not need to be the size of the source Volume, but rather only the size of the data to be recovered. Furthermore, you do not need to worry too much about doing the recovery process in a single shot.  I had several network changed during my recovery processes that caused my NFS to unmount from the Xpenology system. Luckily, the restore process has options to scan the destination for files and will only back up the files that have not been transferred yet.   My total recovery time was about 3 Days including disconnections because I did it over a gigabit network. Most of the time the transfer would saturate the network, but a good portion of the time, I was only seeing 45MB/s. If you need to perform this operation over a gigabit network, Id say a reasonable estimate for average speed was 65MB/s keeping in mind that I had very large files for the most part.

  • Like 1
Link to post
Share on other sites
  • 0

I've just been in trouble with crashed btrfs volume, too.

Also can't mount the volume & storage manager shows crashed volume "0 bytes / 0 bytes".

 

I found a page about btrfs recovery below, and followed those steps described.

https://lists.opensuse.org/opensuse/2017-02/msg00930.html

 

In my case also, "btrfs restore"(Step 8 ) seems to solve the problem.

I try to restore 2.5TB volume(used over 2TB), but only I can restore is about 500GB.

I followed steps further, on Step11, I could recover almost all chunks

(failed only 1 chunk), but chunk tree it self couldn't be recovered.

I tried to restore again on Step12, but this step doesn't recover

any other files other than that of Step 8.

(btrfs recover doesn't overwrite already recovered file, so you can choose same destination to skip already recovered files).

 

------------<excursion>-----------

 

By the way, My ESXi Setup is a bit complex. (Please read the linked page.)

PSOD crashes the SSD cache, and this makes "User DSM" VM's btrfs volume can't be mount.

This was caused by a hardware trouble.

(I moved my server (changed location) recently. After that, my system become unstable...)

 

There were 3 PSODs, two didn't crash the volume at all, but 1 PSOD cause severe crash of that btrfs volume and

a Win10 VM's NTFS drive. (User files are not lost, but It crashes Win10 system drive severely !)

I have to restore Win10 VM from a backup. :(

 

In my setup, my disk was virtualized, so I can use snapshot and try any btrfs recovery without any risk.

(I also copied 2.5TB ext4 vmdk to another volume. It's not required though.)

 

Umm, but there are several trade-offs, and I might not choose same setup if I make another...

 

------------</excursion>-----------

 

I almost abandoned,  but I found this answer  on ask ubuntu, which says,

Quote

What did fix it for me was upgrading the kernel from 3.10 to 3.12. After a reboot the btrfs partition could be mounted again.

 

I confirmed with 'uname -a' and kernels of DSM6.1 was 3.10.

So, I tried to use ubuntu live cd(used 18.04LTS), and 

it magically mount my crashed btrfs volume !

 

I could restore almost all files from crashed volume without losing metadata.

(Using 'btrfs recover', all the metadatas like permissions, timestamps are gone.)

 

I confirmed that 'btrfs chunk-recover' and mount using newer btrfs system is the key for recovering my crash.

(I rolled back to initial state & tried to recover from scratch using Ubuntu Live CD only, 

but I can't mount volume without 'btrfs chunk-recover' before Step 11.

I also tried Step14 without Step11, it also failed.)

 

I recovered using rsync, from crashed btrfs volume to nfs exported "Host DSM" volume (ext4).

You need to exclude "@" started folders (see below sample).
 

Quote

sudo rsync -avh --exclude='/@*/' /mnt/volume1/ /mnt/recover-volume1/ 2>&1 | tee -a /mnt/work/rsync-recover.log

 

(Most important is exclude "@sharesnap", because it is snapshot backup files, and 

 it has so many duplicated files and can't restore if you took many snapshots.)

 

Install gnu screen or tmux to make session detachable & have several work windows.

(without that, you can't shutdown your PC while restoring)

You can monitor read error with "tail -f rsync-recover.log | grep '^rsync:' "(rsync) ,

"sudo tail -f /var/log/syslog | grep -i btrfs"(btrfs system)

(I used script(1) to save error log, because tee doesn't work. It seems something blocks output.

 Error log is important to find which file is unreliable.)

 

 

I feel relief when I found I could recover almost all the files.

I want to follow "3-2-1 backup" rule, though. : -)

Edited by benok
  • Like 1
Link to post
Share on other sites
Guest
This topic is now closed to further replies.