NooL

Volume crashed, 4 disks now showing as Initialized, and no data access

Recommended Posts

Hi guys 

 

So this morning i woke up to a "Volume has crashed" email, this is the first time this has happened. 

 

My setup: 8 Sata hdd's attached to a SAS controller. 

 

At first i had access to the data, but i was foolish enough to reboot the box - after this i no longer had access to data. 

 

I now have 4 out of 8 disks showing as "Initialized" after clicking the "Repair system partition link" it presented me with. But still no access to data. 

 

What are my next steps? How do i go about getting the volume up and running again successfully? 

 

I really hope for help here, since i had about 13tb of data on that volume. 

 

I have attached pictures showing the current state. 

 

2019-10-04 10_28_11-DiskStation - Synology DiskStation – Google Chrome.png

2019-10-04 10_28_30-DiskStation - Synology DiskStation – Google Chrome.png

2019-10-04 10_27_26-DiskStation - Synology DiskStation – Google Chrome.png

Share this post


Link to post
Share on other sites

How many physical drives / sata-ports do your have in your system, and what bootloader/DSM combo do you use?

 

I see "Disk 14" and it might me think about the 12 port limit, as mentioned here can give such a problem.

Share this post


Link to post
Share on other sites
Posted (edited)
5 hours ago, NooL said:

At first i had access to the data, but i was foolish enough to reboot the box - after this i no longer had access to data. 

yeah you got that - and the fact you don't have a backup of your data?

 

also as you had even the exact time when it happen, did you check the logs  (ssh login)?

whatever you do, you first need to know the reason, you need to rule out the reason to prevent it from reoccurring

if you do a repair/rebuild in in the middle of it it strikes again?

 

5 hours ago, NooL said:

I now have 4 out of 8 disks showing as "Initialized" after clicking the "Repair system partition link" it presented me with. But still no access to data. 

3rd or 4th mistake, if you are doing data recovery never "repair" anything, (at least as long as you exactly know what you do and can revert it)

 

btw. when doing such screenshots please change system language to english, same reason why you wrote the post here in english and not in your local tongue

 

i can tell you at least something about how SHR is working (afaik) and in what direction a recovery would have to go

SHR uses different size disks creates partitions on all disks in the size of the smallest disks and forms a raid (usually a raid 5) over all this partitions, that goes on with the left over space of the remaining disks, that result in different raid types used like raid1 instead of raid5 when there are only 2 partitions of the same size

you end up with 2 or 3 (or more) raid sets and these raid sets (mdadm) are then "glued" together with LVM

 

so first step would be so save all logs from the system, if you are lucky you will  find information about creation of the whole volume there, helping you to put the pieces togeter

2nd would be looking for LVM information in /etc, maybe they are still there and if you piece together your raid sets (with mdadm) you might use that

after that you need to map the partition layout of all disks using fdisk

 

with this information you can start looking if you can (first in your mind or on paper) stitch together all partitions to mdadm raid sets so that all fits, if that works out you would repair/reattach the raid sets with mdadm manually and then using lvm to put all the raid volumes together in the right way that it will be your original volume

 

just messing around with it will make things worth to a point where it gets nearly impossible to recover things, if you really need your data look for a recovery service that can do this

 

i hope that can give you a overview about what your situation is

i cant help you much further here because i have done mdadm repairs just for practice and messing around and lvm repair just once (i only use plain mdadm raid5/6 to keep it simple in cases like this)

 

so from my theoretical point of view the layout of partitons would be like this

all disks contain as 1st and 2nd partitons a raid1 set for system (dsm installation) and swap

so you are looking for 3rd partitions and futrther

 

the disks listed are by size

2 2 2 8 4 4 4 8

so 3rd on all 8 disks should be a 2GB partiton

on the disks with the size 8 4 4 4 8 should be a 4th partition with 2 TB

and on the two 8TB disks should be one more with 4TB

so the raid sets you had might have been

8 x 2TB raid5

5 x 2TB raid5

2 x 4TB raid1

 

bringing it up to a 26TB SHR1 volume - assuming that you had one volume

 

 

Edited by IG-88

Share this post


Link to post
Share on other sites
Posted (edited)
8 minutes ago, bearcat said:

How many physical drives / sata-ports do your have in your system, and what bootloader/DSM combo do you use?

 

I see "Disk 14" and it might me think about the 12 port limit, as mentioned here can give such a problem.

 

imho if that would be the case then disk 14 could never be seen as "normal" as part in the volume 1

you would not even see it under disks, you would just see 12 disks/slots

Edited by IG-88
  • Thanks 1

Share this post


Link to post
Share on other sites

@IG-88 Thanks for your reply - i apologize for the danish in the screenshots.

 

And yes, you are absolutely correct in regards to the 26TB SHR1 volume, and yep from what i can see online and a few youtube videos, mdadm is the way to go, from what i read etc - there is a strong chance that the raid can be rebuilt(If nothing else, then so i can take backup of the data) - i am in no way familiar with mdadm though - so i am hoping somebody in here is.

 

In the log i am seeing this:

 

Internal disks woke up from hibernation.
Write error at internal disk [8] sector 10094696.
Storage Pool [1] was crashed.

 

 

Share this post


Link to post
Share on other sites
13 minutes ago, NooL said:

Internal disks woke up from hibernation.
Write error at internal disk [8] sector 10094696.
Storage Pool [1] was crashed.

 

that should not destroy the riad or lvm information, shr1 is made to 1 disk failing, every raid set created in the shr1 process is 1 disk redundant

 

also a additional warning about putting together the raid sets

- my assumptions above where that the whole volume was created in one step it there where more steps by adding disks and extending the volume things will look different

- 2TB partition size of the 8x2TB and 5x2TB might be the same so combining/forcing the wrong 2TB partitions together might end up in a complete mess

look for size difference in 2TB partitions on a disk with 4 or 8 TB and i guess every raid set will have a kind of unique identifier (have not looked into stuff like this for >2 years, can't remember) , so look for something like this before forcing mdadm to do something

- you can do tests/training by using a vm on a computer with virtual disk (thin disk that take no space), like how to repair with mdadm, there are howtos and videos about this and you can practice with a vm (like using in virtual box)

 

Share this post


Link to post
Share on other sites

I believe one other disk is in degraded state as it has 1 bad sector. so thats probably why this triggered it.

 

Share this post


Link to post
Share on other sites

I have a friend who works professionally with unix/linux who will take a look at it monday, hopefully he can work some magic.

 

If there are any suggestions or help, please keep it coming :)

 

Thanks alot for the help so far - will keep you guys updated if i dont hear more in here :)

Share this post


Link to post
Share on other sites

one more thing, you should check the cables if the 4 disk in question belong to the same 4xcable of one SF-8087 connector

Share this post


Link to post
Share on other sites

Yeah the raid was expanded as i used one of the 8tb as a temp storage while i built the NAS.

 

They do not belong to same cable, as the 8TB disks are on different connectors.

 

So its like the screenshot. 2,2,2,8 and 4,4,4,8 on the SAS controller.

 

My theory, but im not sure.. do correct me if its not plausible.

 

I already had one disk (Disk 9) which had 2 bad sectors (This will flag the disk as degraded?) This morning i then got a disk write failure (The one from the log) on disk 8 - which caused it to throw the raid as its a SHR1? Im not sure if this is it though as the raid was marked as healthy in the UI - even with the 2bad sectors on one of the disks.

Share this post


Link to post
Share on other sites

Unfortunately my friend have not had time to look into it yet, but after some intense googling i found that the space_history files contain the disk/partition order i believe.

 

From before the crash it looked like this:

 

Quote

 <space path="/dev/vg1000/lv" reference="/volume1" uuid="G4q5ff-c3BR-6ihp-Qmuk-pq5v-m7I2-zzM32O" device_type="1" drive_type="0" container_type="2" limited_raidgroup_num="24" >
        <device>
            <lvm path="/dev/vg1000" uuid="yx35J7-gCHW-nzOy-v0k6-ZLpw-d2hL-VUz9is" designed_pv_counts="3" status="normal" total_size="25970438307840" free_size="0" pe_size="4194304" expansible="0" max_size="25361762752">
                <raids>
                    <raid path="/dev/md4" uuid="5d73d851:ab74cc15:06e86f73:12be8032" level="raid1" version="1.2">
                        <disks>
                            <disk status="normal" dev_path="/dev/sdj7" model="ST8000DM004-2CX188      " serial="x" partition_version="8" partition_start="7813846336" partition_size="7813999904" slot="1">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdn7" model="ST8000DM004-2CX188      " serial="x" partition_version="8" partition_start="7813846336" partition_size="7813999904" slot="0">
                            </disk>
                        </disks>
                    </raid>
                    <raid path="/dev/md2" uuid="0054f457:e476d61a:5e5b91d9:89d0e1eb" level="raid5" version="1.2">
                        <disks>
                            <disk status="normal" dev_path="/dev/sdg5" model="WD20EARX-00PASB0        " serial="WD-x" partition_version="8" partition_start="9453280" partition_size="3897368960" slot="0">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdh5" model="WD20EACS-11BHUB0        " serial="WD-x" partition_version="8" partition_start="9453280" partition_size="3897368960" slot="1">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdi5" model="HDS722020ALA330         " serial="x" partition_version="8" partition_start="9453280" partition_size="3897368960" slot="2">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdj5" model="ST8000DM004-2CX188      " serial="x" partition_version="8" partition_start="9453280" partition_size="3897368960" slot="6">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdk5" model="ST4000DM000-1F2168      " serial="x" partition_version="8" partition_start="9453280" partition_size="3897368960" slot="3">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdl5" model="ST4000DM000-1F2168      " serial="x" partition_version="8" partition_start="9453280" partition_size="3897368960" slot="4">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdm5" model="WD40EFRX-68WT0N0        " serial="x-x" partition_version="8" partition_start="9453280" partition_size="3897368960" slot="5">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdn5" model="ST8000DM004-2CX188      " serial="x" partition_version="8" partition_start="9453280" partition_size="3897368960" slot="7">
                            </disk>
                        </disks>
                    </raid>
                    <raid path="/dev/md3" uuid="01aab698:390b4ffb:c955e83b:5b358a42" level="raid5" version="1.2">
                        <disks>
                            <disk status="normal" dev_path="/dev/sdj6" model="ST8000DM004-2CX188      " serial="x" partition_version="8" partition_start="3906838336" partition_size="3906991904" slot="3">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdk6" model="ST4000DM000-1F2168      " serial="x" partition_version="8" partition_start="3906838336" partition_size="3906991904" slot="0">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdl6" model="ST4000DM000-1F2168      " serial="xx partition_version="8" partition_start="3906838336" partition_size="3906991904" slot="1">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdm6" model="WD40EFRX-68WT0N0        " serial="x" partition_version="8" partition_start="3906838336" partition_size="3906991904" slot="2">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdn6" model="ST8000DM004-2CX188      " serial="x" partition_version="8" partition_start="3906838336" partition_size="3906991904" slot="4">
                            </disk>
                        </disks>

 

 

is this any help?

Share this post


Link to post
Share on other sites

that looks useful

one logigal volume

3 raid sets

sdg to sdn (8 disks)

dev/md2 over all 8 disks 2TB from every disk

/dev/md3 over 5 disks 2TB from eache, even slightly different in size from md2

/dev/md4 over 3 on 2x8TB disks, 4 TB partitions

exactly as it was to expect

you have the mdadm infos for every partition, even the start sectors and size of the partitions

also uuid for every raid set

what a cant see is how the 3 raid sets where arranged to be one volume

i guess with this information you should be able to use a rescue linux and read all data

if you find out enough about how synology handles the stuff internally you could also try to repair it in synology - but afaik there is no documentation so the safe way would be a rescue linux and then offload the data

how about this to continue?

 

Share this post


Link to post
Share on other sites

Yeah i've been reading up on that, and thats the one that gave me hope. 

From what i read, it should be possible to mdadm assemble the raid in the correct order and get access to the data again. 

Share this post


Link to post
Share on other sites
Posted (edited)

SUCCESS!!!

 

I now have access to my data!

 

mdadm --assemble --force --run /dev/md2 /dev/sdg5 /dev/sdh5 /dev/sdi5 /dev/sdk5 /dev/sdl5 /dev/sdm5 /dev/sdj5 /dev/sdn5 -v

 vgchange -a y vg1000  

 lvm lvchange -ay /dev/mapper/vg1000-lv

mount /volume1

 

was the magic commands.

 

Any ideas on how to re-create the Meta-data of Shared Folders? I can see in the log that it was removed after reboot following the crash.

 

Edited by NooL
add
  • Like 1

Share this post


Link to post
Share on other sites
5 hours ago, NooL said:

mdadm --assemble --force --run /dev/md2 /dev/sdg5 /dev/sdh5 /dev/sdi5 /dev/sdk5 /dev/sdl5 /dev/sdm5 /dev/sdj5 /dev/sdn5 -v

 

/dev/md2 is just 8 x 2TB raid5, usable space 14TB

and what about /dev/md3 and /dev/md4?

you volume1 was more then this 14TB

 

 

Share this post


Link to post
Share on other sites

Same deal basically with md3 - md4 was fine already.

 

mdadm --assemble --force --run /dev/md3 /dev/sdk6 /dev/sdl6 /dev/sdm6 /dev/sdj6 /dev/sdn6 -v 

 

i tried to --re-add at first with md3 since it was just missing 1 drive, but the event counts were to much different , so had to re-assemble it with above :)

Share this post


Link to post
Share on other sites

And the raid parity check went with no problems.

 

BUT - i am still not seeing the "Shared Folders" in the GUI - they are there and i can copy data via SSH - but i am not seeing them in GUI - which is making everything a bit harder as I cannot use the built in tools to transfer (Windows share, ftp, etc etc)

 

Does anybody have an idea how i can restore the Shared Folder metadata that was lost in the crash/reboot? :)

Share this post


Link to post
Share on other sites

SUCCESS AGAIN! 

 

Synocheckshare was the magic command to get my shares back and running. 

 

Everything is now as if nothing had happened. 

  • Like 2

Share this post


Link to post
Share on other sites

and what do we learn from that (beside the things about mdadm)? having a backup of the (important) data is nice ;-)

(at least its what i took from that kind of situation years ago - my raid broke on a real synology because of a crash caused by usb connected usv that was on the supported synology list)

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.