Help I lost access to ~45 Tb of data ...

satdream · December 21, 2019

Hi everybody,

I am in a big mess ... and need your expertise/support (I am French sorry for my English)

Configuration

Hardware

HP Gen 8 micro, i3 processor update, 12 Gb Ram + PCIe dual eSata board HP 120 with:

x4 main HDDs on integrated Gen8 slots are connected to the HP 120 (the HP 4 disks eSata cable connected to the HP Card)
an external extension Box with x4 additionnal HDD (mix of SAS/SATA Disk) is connected via eSata cable
On the former HP Gen 8 eSata on motherboard: x2 SSD in cache via eSata/4 SATA cable (2 remain free)

Total: x10 Disks, SHR/BTRSF, ~45 Tb available w/tolerance 1 HDD Crash ...

Software

SDCard boot w/June 1.02b for 3617xs + DSM 6.1.2 from long time (only need NAS access with shared folders etc. + DSMusic + 1 VM ...)

Context

Home PC main folders (Documents, Pictures etc.) are on the NAS via symbolink links from Windows 10 = no data on the different PC, everything on the NAS ...

Issue

Nothing appenned during ~3 years til yesterday, basic check on DSM, widget suddently crashing, Web UI frozen ...

I rebooted via the SSH connection as root then ... nothing = Impossible to connect on the WebUI, no more with SSH ...

Tried to solve

reflash the loader 1.02b/3617xs as it was = no detection
reflash the loader 1.02b/3617xs forcing DSM 6.1.2 reinstall = not possible to reconnect on WebUI after initial SynoAssistant config
reflash the loader 1.03b/3617xs with DSM 6.2 install = detect the 8 HDD disks but not the 2 SSD ... = parity check Ok but no volume
reflash the loader 1.03b/3615xs with DSM 6.2 install = detect the 8 HDD disks but not the 2 SSD ... = parity check Ok but no volume

=> DSM shows error pooling the volume 1 as failed and the read-write SSD cache of the volume is missing ... ask to shutdown/reinsert SSD/Rebootetc.
=> the SSD are Ok and correctly detected by the HP Gen8 during boot ...

=> no SSH access ... even if reconfigured

It is catastrophic as I have all my "life" on this NAS: pictures/movies family(my wedding, dead people memory etc.), working documents, music etc.

As the all HDD are Ok, I am pretty sure the data are still Ok, only BIG issue with DSM ...

Do you have any idea/suggestion to help ?

Thanks for your support (sorry for my alarming wording, but is it an critical issue for me ... and obviously I have no backup of ~45 Tb of Data !)

Fred.

Ps: was ready to buy a "real" physical Syno 3617xs hopping to solve someting , but it do not manage SAS HDD I have ...

Edited December 21, 2019 by satdream

satdream · December 21, 2019

My mistake, the proc is a i5 3470T, and the eSata PCIe board is a Dell H310 6Gbps SAS HBA w/ LSI 9211-8i P20 IT Mode ... but working in any case as connected HDD recognized ...

Edited December 21, 2019 by satdream

jarugut · December 21, 2019

Hi Satdream,

if you are sure that the ssd are ok, one option that you have to get access to all your data and make a copy of data is following the tutorial using a livecd of ubuntu to mount the volume.

satdream · December 21, 2019

Thanks Jarugut,

Issue is that to backup/made a copy of 45 Tb I need a set of additional x5 12Tb disks (but thanks, according criticality of data it is acceptable, even if I have no other config to install 5x disks ...) ...

But, I continued to investigate, as I have a free small SSD and 2 others small disks not used I made few tests on hardware and configs of loader etc.

I connected:

- the SSD to the HP Gen8 internal esata (instead the cache SSD)

- HDD1 into the HP Gen8 Rack connected to the DELL H310

- HDD2 into the Extension Rack connected to the DLL H310

and I discovered that:

1.02b+6.1.2 = SSD OK, Disk in Gen8 OK, Disk in Extension OK, install DSM Ok

1.03b+6.2 = SSD Not OK, Disk in Gen8 OK, Disk in Extension OK, install DSM Ok

1.03b+6.2.2 (Thanks to the huge work of IG-88 with driver) Extension = SSD Not OK, Disk in Gen8 OK, Disk in Extension OK, install DSM Ok

Then I tried with only the SSD and no HDD ... same results working w/1.02b and not working with 1.03b !!!

1st conclusions:

- except with 1.02b I do not see the SSD and btw the integrated HP Gen8 eSata ... very strange as others here have working environements on Gen8 like this ...

- I probably briked my install with tests with other loader and upper DSM realse (Ok I was completly is crisis mode, and do not think to evalute new config 1st with test drives ...)

Solutions ?

- Perform a downgrade, but not easy as all the slots are occupied, moving out the Sata1 HDD by a new empty one applying the tuto on downgrade ?

- Investigate why the basic internal eSata is not managed correctly, assuming the optional H310 works fine ...

Question: is somebody having a working Gen8 with internal eSata + extension board ? something different in the configuration of Grub.cfg between 1.02 and 1.03 ? or any change to perform in the BIOS concerning configuration ?

Thanks again for your support and suggestion, in Christmas period, I do hope few experts here are reading

Edited December 21, 2019 by satdream

flyride · December 22, 2019

6 hours ago, satdream said:

read-write SSD cache of the volume is missing

Another one bites the dust. At some point folks should heed the warnings about this time bomb waiting to happen.

What type of SSD drive were you using for cache? Were you monitoring for SSD health? How much SSD lifespan do you believe was remaining?

satdream · December 22, 2019

Il y a 8 heures, flyride a dit :

Another one bites the dust. At some point folks should heed the warnings about this time bomb waiting to happen.

What type of SSD drive were you using for cache? Were you monitoring for SSD health? How much SSD lifespan do you believe was remaining?

Thanks, they are pretty new, as x2 Samsung 850 EVO 120 GB, taking place it was ~6 month after death of x1 Sandisk SDD, I changed both, as Samsung SDD I was able last night to mount them (without dommage for data) under PC/Windows and check them using Samsung Magician tool with good health results etc.

I suppose now an issue in BIOS configuration to customize from 1.02b to 1.03b loader or in the files configuration of the loader ... but no idea on the exact issue

Thanks.

Edited December 22, 2019 by satdream

satdream · December 22, 2019

I progressed as I discover the Gen8 BIOS reseted (?!?) and that the only parameter to move is back to default, btw had to reactive AHCI ... then SSD are seen ...

But now issue is that with the 1.03b the order of disks is wrong, and before trying to recover with the 10x HDD I need to have them in correct order ...

Made test with 1 SDD on internal eSata, x1 HDD on HP Gen8 internal slot (connected to 1st Dell H310 port), x1 HDD on external rack (connected to the 2d Dell H310 port)

Tried several configuration w/SasIdxMap, DiskIdxMap and SataPortMap options but the install DSM see the 4 1st Sata correctly as Disk1-4, then the SAS ones start at #7 instead #5 ...

Btw, the SAS card disk order is the order of inserted HDD, that's Ok, but the starting disk id is wrong, and even if with SataPortMap=FFFFFFFE, it do not give the #5 id but #7 ...

Thanks.

Edited December 22, 2019 by satdream

flyride · December 22, 2019

18 hours ago, satdream said:

- I probably briked my install with tests with other loader and upper DSM realse (Ok I was completly is crisis mode, and do not think to evalute new config 1st with test drives ...)

I tried to write up a plan to assist you but am not going to post it as I realize it won't help right now. You are still in crisis mode (your words) and need to get control of the situation. Whenever an event like this happens, the objective should be to minimize the change in the system. Unless you are very experienced, you won't know what changes may make recovery impossible. Given the frantic and unstructured effort thus far, I have little hope for recovery, but if there is any chance for online help: be patient, slow down the change rate and post more comprehensive information about the system state.

What was the event that caused the original outage? You stated "basic check on DSM" what is that specifically? Was it a DSM upgrade (planned or unplanned)? Other system upgrade? Hardware failure?
What is the exact configuration of the system currently? What type/size of drives connected to what controllers? List everything out, don't summarize. What loader, what DSM version?
Does DSM boot? What is the state of the system? Can all the drives be seen? Does the volume show crashed? Is there a disk pool state? Post screenshots.

If DSM boots, SSH should be accessible. Can you validate the SSH/TELNET configuration in the Control Panel? Are you running a custom port? Getting to a shell will be important to improve your chances for recovery.

If it isn't clear from the recommendations above: if you want help here, STOP CHANGING THINGS and gather more information.

satdream · December 23, 2019

Il y a 16 heures, flyride a dit :

What was the event that caused the original outage? You stated "basic check on DSM" what is that specifically? Was it a DSM upgrade (planned or unplanned)? Other system upgrade? Hardware failure?

What is the exact configuration of the system currently? What type/size of drives connected to what controllers? List everything out, don't summarize. What loader, what DSM version?

Does DSM boot? What is the state of the system? Can all the drives be seen? Does the volume show crashed? Is there a disk pool state? Post screenshots.

If DSM boots, SSH should be accessible. Can you validate the SSH/TELNET configuration in the Control Panel? Are you running a custom port? Getting to a shell will be important to improve your chances for recovery.

Many thank for your consideration regarding my issues and mind, I feel calm now enough to made a better status.

1. I have no idea what appenned, but it look likes or a SDCard issue or a Bios one ... as finally I had problems to reflash the bootloader on the original SDCard, and the Bios configuration reseted

2. The exact configuration and status are shown here (showing the bad attemps to recover assuming I am now in 1.03b+drivers/6.2.2 DSM accessible)

3. Not all drives can be seen: that's now the problem ... if in "Legacy" I see the whole HDD but not the SSD Cache and if in "AHCI" I see the SSD Cache but part of disks is missing ...

I made positionning tests with test disks (not problem to crash them etc.), and the results is here:

Current conclusion: I need some help to better understand the configuration in order to see all drives in the AHCI configuration ...

Thanks for your advise and expertise in support !

flyride · December 23, 2019

Well, this is a start. It seems that you have two choices:

Revert the DSM back to your original 6.1.x version (where all your drives could be accessible). There are no "unknowns" here as it clearly was working before; there is no reason it cannot work again. I still think we are missing some information as to why this happened in the first place (what was done with "basic check on DSM" in first post).
Troubleshoot 6.2.2 controller configuration to try and get all drives accessible. There are many "unknowns" here and you are essentially doing an initial install and hoping your data will come over cleanly. I think this path is a mistake if the safety of your data is important, but this is the path you have been following.

My advice is to go back to #1 and your original configuration. Do you have a backup copy of your original loader? If not, can you configure it exactly as you did before?

You have not mentioned your DSM code base, but I assume DS3615xs, please confirm.

Rather than to try and install DSM back to your old drives, disconnect all data drives (including cache) except a new 1TB test disk installed to an unused motherboard SATA port. Then install 1.02b loader and basic 6.1.7 DSM clean install. Don't build a Storage Pool or volume, just get it to boot up and Control Panel/Disk Manager accessible. Set up SSH access and report back.

satdream · December 23, 2019

Il y a 2 heures, flyride a dit :

Well, this is a start. It seems that you have two choices:

Revert the DSM back to your original 6.1.x version (where all your drives could be accessible). There are no "unknowns" here as it clearly was working before; there is no reason it cannot work again. I still think we are missing some information as to why this happened in the first place (what was done with "basic check on DSM" in first post).

Troubleshoot 6.2.2 controller configuration to try and get all drives accessible. There are many "unknowns" here and you are essentially doing an initial install and hoping your data will come over cleanly. I think this path is a mistake if the safety of your data is important, but this is the path you have been following.

My advice is to go back to #1 and your original configuration. Do you have a backup copy of your original loader? If not, can you configure it exactly as you did before?

You have not mentioned your DSM code base, but I assume DS3615xs, please confirm.

Rather than to try and install DSM back to your old drives, disconnect all data drives (including cache) except a new 1TB test disk installed to an unused motherboard SATA port. Then install 1.02b loader and basic 6.1.7 DSM clean install. Don't build a Storage Pool or volume, just get it to boot up and Control Panel/Disk Manager accessible. Set up SSH access and report back.

Yes, thanks, I consider btw the 1st alternative as the most secure, and I will perform the test you suggest.

But fyi I continued to investigate :

HP Gen 8 microserver hardware includes a eSata/4xSATA internal + 2xSata internal (for CD-Rom purpose, the "ODD" connector)

- In Legacy mode (not Loader 1.03b compatible) the Gen8 consider 4+2 Sata distinct i/f but with "SataPortMap=44" in grub.cfg (the second 4 have no impact) is restrict on #1 to #4 idx

=> and the SAS Card connected HDD are mapped in order (SasidxMap=0) starting at idx #5

- In AHCI mode the Gen8 consider unique SATA with 4+2 connectors = 6 ... and the "SataPortMap=4" have no effect to restrict the number of SATA port, in get #1 to #6 idx ...

=> consequently the SAS Card connected HDD are mapped starting at idx #7 ...

Btw, I proceed in order to extended the number of supported HDD in synoinfo.conf from 12 to 14 Disks but ... the missing disks in the inclosure are not displayed by DSM ... even if they are recognized by the system, as an extract from dmesg shows:

[ 1.726330] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) = 1st connected SATA disk detected
...
[ 1.727050] sd 0:0:0:0: [sda] 234441648 512-byte logical blocks: (120 GB/111 GiB) = SSD1 = sda => mount as Disk 1
...
[ 2.190444] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) = 2d connected SATA disk detected
...
[ 2.205887] sd 1:0:0:0: [sdb] 234441648 512-byte logical blocks: (120 GB/111 GiB) = SSD2 = sdb => mount as Disk 2
...
[ 2.511531] ata3: SATA link down (SStatus 0 SControl 300) = sdc = nothing more connected on Gen8 SATA
[ 2.816612] ata4: SATA link down (SStatus 0 SControl 300) = sdd
[ 3.121693] ata5: SATA link down (SStatus 0 SControl 300) = sde
[ 3.426775] ata6: SATA link down (SStatus 0 SControl 300) = sdf
...
[ 3.821242] mpt2sas0: LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(07.39.02.00) = PCie board detected
[ 3.821243] mpt2sas0: Dell 6Gbps SAS HBA: Vendor(0x1000), Device(0x0072), SSVID(0x1028), SSDID(0x1F1C)
...
[ 6.155956] sd 6:0:0:0: [sdg] 9767541168 512-byte logical blocks: (5.00 TB/4.54 TiB) = HDD1 = sdg => mount as Disk 7
...
[ 6.655951] sd 6:0:1:0: [sdh] 9767541168 512-byte logical blocks: (5.00 TB/4.54 TiB) = HDD2 = sdh => mount as Disk 8
...
[ 7.156168] sd 6:0:2:0: [sdi] 9767541168 512-byte logical blocks: (5.00 TB/4.54 TiB) = HDD3 = sdi => mount as Disk 9
...
[ 7.656316] sd 6:0:3:0: [sdj] 9767541168 512-byte logical blocks: (5.00 TB/4.54 TiB) = HDD4 = sdj => mount as Disk 10
...
[ 22.320896] sd 6:0:4:0: [sdk] 19532873728 512-byte logical blocks: (10.0 TB/9.09 TiB) = HDD6 = sdk => mount as Disk 11 = SAS disk !!!
...
[ 27.755589] sd 6:0:5:0: [sdl] 19532873728 512-byte logical blocks: (10.0 TB/9.09 TiB) = HDD8 = sdl => mount as Disk 12 = SAS disk !!!
...
[ 28.160476] sd 6:0:6:0: [sdm] 15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB) = HDD5 = sdm = Not Mounted = SATA disk !!
...
[ 28.691687] sd 6:0:7:0: [sdn] 19532873728 512-byte logical blocks: (10.0 TB/9.09 TiB) = HDD7 = sdn = Not Mounted = SATA disk !!

But what is very strange is that during test with test disks I installed Sata disks in enclosure and they was mounted ... btw two possibilities: or a mix of SAS and SATA is now not compatible of mounting (why ?!?), or I missed something in the Grub.cfg with loader 1.03b assuming current parameter is:

set sata_args='sata_uid=1 sata_pcislot=5 synoboot_satadom=1 DiskIdxMap=0E SataPortMap=68 SasIdxMap=0'

My assumption is a set of 2 SATA controler (Rq: if I put only one with 6 ports, the disks are mixed in index but still 2 missing)

Will perform test with 1.02b loader and test HDD ... assuming I have the original loader configuration/grub.cfg

Many thanks !

Edited December 23, 2019 by satdream

satdream · December 24, 2019

SOLVED

Hi everybody, a big thanks to all of you that supported me, and thanks to this forum and other ones ... as I finally solved my issue.

@flyride: many thanks for our support and suggestion, btw I tried a last change in DSM configuration last night (before doing the test w/1.02b + 6.1.7 + HDD 1To) ... "but" ... the system found overall disks and initiated a recovery, btw I was not able to perform tests, and the rescue target is finally the 3615xs/1.03b/6.2.2 as it started to recover ...

Details:

- my description was a bit wrong, the Gen8 has not a eSata connector inside but a SFF-8087 miniSAS, as the same on the DELL H310 with x2 SFF-8087 miniSAS, and external inclosure is a SFF-8088 SAS connected on H310 with a classic SFF-8088 to SFF-8087 ...

= no eSATA

- I discover investing the demsg that the reported number of HDD SDD is correct but the mapped disks is incorrect, with 2 missing ones .

= I remember that the DSM do not display eSATA drives in the disks pool/available ones

- Thanks to disks pool structure, the order and mapping have no impact on DSM (I strongly recommend to create a pool of disk previously to volume or to convert volume to pool in order to be independant of disks mapping id)

- Finally, following this forum (a bit old topic as conf files changed and do not addressing external SAS SCSI mapped drives etc. ) https://hedichaibi.com/fix-xpenology-problems-viewing-internal-hard-drives-as-esata-hard-drives/ I updated the conf files in order to:

Disable eSATA updating esataportcfg to 0x00000 (instead default 0xff000) = 0 eSATA
Align exact USB ports updating usbportcfg to 0x7C000 (instead default 0x300000) = 5 USB
Align exact port to consider updating internalportcfg to 0x3FFF (instead of default 0xfff) = 14 Disks ports

Then after rebooting (the citated topic is not correct considering external SAS, it is mandatory to reboot), the DSM found overall disks with lot of errors and "strange" ones:

- One SSD not synchro = corrected immediatly via resync

- A system partition wrong on a disk = repaired asap (as in SHR the system partition is duplicated on several disks)

- A disk not included in the Disks pool = waiting to repair

- A disk in default = waiting to understand and repair ...

But ... all data available (from few checks performed picking files here and here ...), it is very surprizing if a disk is out of the pool and another one in default ... as tolerance of only 1x disk crash ... I suspect a wrong report ... but I have to wait to finalize check.

Current status is DSM performing a global disks coherence check, it will take ~24h to verify, then I will reboot and check if something remains wrong.

But for the moment I am able to backup, not all data, but the most critical ones !

Thanks.

Edited December 24, 2019 by satdream

flyride · December 24, 2019

Well that may be good news; hopefully it all works out for you!

When this is all done (and you have your data back) please consider changing your r/w SSD cache to read only. Performance improvement from r/w SSD cache can be achieved far more safely with more RAM (and most "normal" NAS users will never see benefit from r/w cache anyway).

Another way to look at it is this:

If r/w cache crashes, your volume is corrupt (that is why Synology wants RAID 1 so that a simple hardware failure does not destroy a volume).

If r/o cache crashes, you can just delete the cache and your volume is intact.

satdream · December 25, 2019

Il y a 17 heures, flyride a dit :

Well that may be good news; hopefully it all works out for you!

When this is all done (and you have your data back) please consider changing your r/w SSD cache to read only. Performance improvement from r/w SSD cache can be achieved far more safely with more RAM (and most "normal" NAS users will never see benefit from r/w cache anyway).

Another way to look at it is this:

If r/w cache crashes, your volume is corrupt (that is why Synology wants RAID 1 so that a simple hardware failure does not destroy a volume).

If r/o cache crashes, you can just delete the cache and your volume is intact.

Thanks, I was using SSD cache in r/o, never in r/w, just as a facility to ease video operation (eg. fast-back from cache etc.)

But I tried as suggested to delete it, then the volume is still with a failed HDD, btw a 10Tb Seagate Ironwolf Pro Red special NAS ! pretty new ... = will replace it asap then check it and potentially ask waranty to be apply.

It remains suprizing that 1 disk is in default (the 10Tb), and another one (a 5Tb) is not in status "Normal" but in status "Initiated" ... I do not understand why it is not in status "Normal" assuming it works as the whole data are available ... and the faulty 10Tb was excluded from the disks pool/volume !

Thanks for your advises.

Edited December 25, 2019 by satdream

flyride · December 25, 2019

On 12/21/2019 at 10:35 AM, satdream said:

=> DSM shows error pooling the volume 1 as failed and the read-write SSD cache of the volume is missing ... ask to shutdown/reinsert SSD/Rebootetc.

12 hours ago, satdream said:

Thanks, I was using SSD cache in r/o, never in r/w, just as a facility to ease video operation (eg. fast-back from cache etc.)

The above from your initial post, then your most recent report.

FWIW a drive with status "Initialized" means it is not in use in an array.

satdream · December 26, 2019

Il y a 11 heures, flyride a dit :

The above from your initial post, then your most recent report.

FWIW a drive with status "Initialized" means it is not in use in an array.

Yes, listed the exact message shown by DSM that is generic about SSD installed structure telling about "read-write SSD cache" ... but it was obviously configured in "read-only"

I fully agree about the status "Initialized" as not strictly in use in disks pool, but I figure a wrong status or it was not possible to get access to data assuming the faulty HDD is also out of the pool ... and btw 2 disks out do not allows access to data (or I miss something in pool mechanisms)

Current status is:

- after 24h hours of parity (coherence) check = OK

- x2 SSD excluded (cache switched "Off")

- x3 5Tb + x1 WD 8Tb + x2 Seagate Exos = Normal

- 5Tb Toshiba = Initialized

- 10Tb Seagate Ironwolf Pro = still in default

=> Smart full test in progress on the Ironwolf Pro = On-going (~20h planned)

=> Spare 10Tb disk ordered for exchange asap

Thanks !

Ps: for Gen8 users w/mixed SATA/SAS Disks = do not configure the "SasIdxMap=0", it shall not be part of Grub.cfg in order to allow detection of mix of Disks, if configured the SATA disks are not mapped by DSM.

Edited December 26, 2019 by satdream

satdream · December 26, 2019

Hi again,

last news: after 20h hours of test, the IronWolf Pro 10TB is analysed at “Normal” without Errors … 😰

But DSM shows it as faulty … I do not understand (again, sorry) as the configuration is:

Toshiba 5TB: sdg, sdj, sdk, sdl

Seagate Exos 10TB: sdh, sdi

Western Digital 8TB: sdn

Seagate IronWolf Pro 10TB: sdm

And I understand the “initiated” status is certainly correct on the Toshiba 5TB, and the “faulty” IronWold Pro 10TB is fully operational …

@flyride Could you please confirm my analyse bellow ?

dmesg extract:

...

[ 32.965011] md/raid:md3: device sdk6 operational as raid disk 1 = Toshiba 5 TB

[ 32.965013] md/raid:md3: device sdm6 operational as raid disk 7 = IronWold Pro 10 TB

[ 32.965013] md/raid:md3: device sdh6 operational as raid disk 6 = Exos 10 TB

[ 32.965014] md/raid:md3: device sdi6 operational as raid disk 5 = Exos 10 TB

[ 32.965015] md/raid:md3: device sdn6 operational as raid disk 4 = WD 8 TB

[ 32.965016] md/raid:md3: device sdg6 operational as raid disk 3 = Toshiba 5 TB

[ 32.965016] md/raid:md3: device sdj6 operational as raid disk 2 = Toshiba 5 TB

…

[ 32.965507] md/raid:md3: raid level 5 active with 7 out of 8 devices, algorithm 2

[ 32.965681] RAID conf printout:

[ 32.965682] --- level:5 rd:8 wd:7

[ 32.965683] disk 1, o:1, dev:sdk6 = Toshiba 5TB

[ 32.965684] disk 2, o:1, dev:sdj6 = Toshiba 5TB

[ 32.965685] disk 3, o:1, dev:sdg6 = Toshiba 5TB

[ 32.965685] disk 4, o:1, dev:sdn6 = WD 8 TB

[ 32.965686] disk 5, o:1, dev:sdi6 = Exos 10 TB

[ 32.965687] disk 6, o:1, dev:sdh6 = Exos 10 TB

[ 32.965688] disk 7, o:1, dev:sdm6 = IronWolrd Pro 10 TB

…

mdstat returns:

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [raidF1]

md3 : active raid5 sdk6[1] sdm6[7] sdh6[6] sdi6[5] sdn6[4] sdg6[3] sdj6[2]

6837200384 blocks super 1.2 level 5, 64k chunk, algorithm 2 [8/7] [_UUUUUUU]

md2 : active raid5 sdk5[1] sdh5[7] sdi5[6] sdm5[8] sdn5[4] sdg5[3] sdj5[2]

27315312192 blocks super 1.2 level 5, 64k chunk, algorithm 2 [8/7] [_UUUUUUU]

md5 : active raid5 sdi8[0] sdh8[1]

3904788864 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/2] [UU_]

md4 : active raid5 sdn7[0] sdm7[3] sdh7[2] sdi7[1]

11720987648 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/4] [UUUU_]

md1 : active raid1 sdg2[0] sdh2[5] sdi2[4] sdj2[1] sdk2[2] sdl2[3] sdm2[7] sdn2[6]

2097088 blocks [14/8] [UUUUUUUU______]

md0 : active raid1 sdg1[0] sdh1[5] sdi1[4] sdj1[1] sdk1[2] sdl1[3] sdm1[6] sdn1[7]

2490176 blocks [12/8] [UUUUUUUU____]

=> In all cases the Toshiba 5TB (identified as “sdl”) is never used … except in md0/md1 partition (=status initiated) ... and the “faulty” IronWolf Pro (identified as "sdm") is working … !!!

...

mdadm --detail /dev/md2 gives this result:

/dev/md2:
Version : 1.2
Creation Time : Thu Aug 3 09:46:31 2017
Raid Level : raid5
Array Size : 27315312192 (26049.91 GiB 27970.88 GB)
Used Dev Size : 3902187456 (3721.42 GiB 3995.84 GB)
Raid Devices : 8
Total Devices : 7
Persistence : Superblock is persistent

Update Time : Thu Dec 26 18:41:14 2019
State : clean, degraded
Active Devices : 7
Working Devices : 7
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Name : Diskstation:2 (local to host Diskstation)
UUID : 42b3969c:b7f55548:6fb5d6d4:f70e8e8b
Events : 63550

Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 165 1 active sync /dev/sdk5
2 8 149 2 active sync /dev/sdj5
3 8 101 3 active sync /dev/sdg5
4 8 213 4 active sync /dev/sdn5
8 8 197 5 active sync /dev/sdm5
6 8 133 6 active sync /dev/sdi5
7 8 117 7 active sync /dev/sdh5

Btw, the Raid mechanisms is working, as data are accessible ... but DSM do not change status on IronWolf Pro …😖

I was close to a big mistake: I have not to change the IronWolf Pro because it is part of the working RAID, and the 20 hours test (SMART Extended) on the IronWolf Pro shows a correct status without currupted sector etc. confirm this.

How do I force DSM to change status on disk IronWolf Pro 10 TB from faulty to normal ?!?

Because the Toshiba 5 TB is not initiated (and perhaps the faulty drive ?) but DSM do not want to initiate it til the 10TB is not "replaced" ... that I have certainly not to do if I want to keep all data ...

Thanks !

Edited December 26, 2019 by satdream

flyride · December 26, 2019

You keep referring to /dev/sdm as faulty. What makes you think it is faulty? Post a screenshot of what is making you come to this opinion. If it is a Smart status, please post the Smart detail page for that drive.

I suspect that you have some sectors flagged for replacement on /dev/sdm, and DSM is reporting the disk as "Failing" or "Failed" which doesn't actually mean anything other than the number of sectors exceeds the threshold you have set in DSM for bad sectors. I also believe all your arrays are in critical state, so once we validate /dev/sdm you will probably want to initiate an array repair with /dev/sdl. Only then should you consider action on how to correct /dev/sdm status.

satdream · December 26, 2019

Here is the screenshot, but sorry it is in French ... IronWolf is displayed as "En panne" (="Failed" or "Broken") but with 0 bad sectors (but the SMART Extended test returned few errors)

the pool status as "degraded" with the "failed drive" show as "Failed allocation status" and the rest of disks as normal

the list of disks (the unused Toshiba is displayed as "Initied")

And the SMART status of the IronWolf

Many Thanks !

Edited December 26, 2019 by satdream

flyride · December 26, 2019

Ok, try this:

Note the serial number of /dev/sdm (the drive identified as Failing)
ssh into your system and edit /var/log/disk_overview.xml
Delete the xml tag for that specific drive (identified by the serial number)
Reboot

Hopefully that will get rid of the drive failing error, which we believe to be spurious.

If everything then looks clean, initiate a repair using the remaining Toshiba "Initialized" drive.

satdream · December 26, 2019

il y a 43 minutes, flyride a dit :

Ok, try this:

Note the serial number of /dev/sdm (the drive identified as Failing)

ssh into your system and edit /var/log/disk_overview.xml

Delete the xml tag for that specific drive (identified by the serial number)

Reboot

Hopefully that will get rid of the drive failing error, which we believe to be spurious.

If everything then looks clean, initiate a repair using the remaining Toshiba "Initialized" drive.

Done, removedxml tag of IronWolf, but after reboot still the same status ... and I see that the former SSD I removed from install (disconnected) are listed in the disk_overview.xml

What about performing a reinstall ? btw changing serialid of 3615xs in order to initiate a migration ? will it reset status ? as I do not reinstall applications/config etc. not a big issue to perform a migration ...

Thanks a lot

flyride · December 26, 2019

Don't reinstall/migrate. That creates risk for no reason. Let's try and rename the disk_overview.xml to disk_overview.xml.bak and reboot again.

satdream · December 26, 2019

il y a 17 minutes, flyride a dit :

Don't reinstall/migrate. That creates risk for no reason. Let's try and rename the disk_overview.xml to disk_overview.xml.bak and reboot again.

No change, the regenerated disk_overview.xml is now without the disconnected SSD, the structure of each disk is exactly the same tag as for the IronWolf:

<SN_xxxxxxxx model="ST10000VN0004-1ZD101">
<path>/dev/sdm</path>
<unc>0</unc>
<icrc>0</icrc>
<idnf>0</idnf>
<retry>0</retry>
</SN_xxxxxxx>

Where the DSM is storing the disk status ? I remember something as Synology has a customized version of md driver and mdadm toolsets that adds a 'DriveError' flag to the rdev->flags structure in the kernel ... but don't know I to change it ...

Thx

flyride · December 26, 2019

"IronWolf is displayed as "En panne" (="Failed" or "Broken") but with 0 bad sectors (but the SMART Extended test returned few errors) "

Sorry, does this mean that the SMART Extended test reported errors, or that no errors were reported? If errors were reported, what were they? Based on that report, do you feel confident that the drive is okay?

If you think the drive is error-free, delete the Smart Test log history by removing /var/log/smart_test_log.xml and reboot again.

satdream · December 26, 2019

il y a une heure, flyride a dit :

"IronWolf is displayed as "En panne" (="Failed" or "Broken") but with 0 bad sectors (but the SMART Extended test returned few errors) "

Sorry, does this mean that the SMART Extended test reported errors, or that no errors were reported? If errors were reported, what were they? Based on that report, do you feel confident that the drive is okay?

If you think the drive is error-free, delete the Smart Test log history by removing /var/log/smart_test_log.xml and reboot again.

Sorry for wrong formulation mixed Smart test and Smart status ... :

- the SMART Extended reported No Errors, everything as "Normal" = final disk health status is "normal":

- But strangly (see the hardcopy of SMART details) the SMART status shows few errors (assuming SMART details are sometime a bit "complex" to analyze):

Raw_read error_rate
Seek_error_rate
Hardware_ECC_recovered

Checking on few forums it seems not really an issue and common with IronWolf ... assuming the reallocated sector is at 0 ...

Feel pretty confident on SMART extended tests as it is a non-destructive physical tests (read value/write value of each disk sector) ...

In addition IronWolf have a specific test integrated in DSM, that I run and report No Error (000. Normal)

Found no smart_test_log.xml but:

/var/log/healthtest/ dhm_<IronWolfSerialNo>.xz

/var/log/smart_result/2019-12-25_<longnum>.txz

I keep them waiting recommandation !

What about rebuild the array ?

Sequence to rebuild the array is well known as following one:

1. umount /opt
2. umount /volume1
3. syno_poweroff_task -d
4. mdadm -–stop /dev/mdX
5. mdadm -Cf /dev/mdxxxx -e1.2 -n1 -l1 /dev/sdxxxx -u<id number>

6. e2fsck -pvf -C0 /dev/mdxxxx

7. cat /proc/mdstat
8. reboot

but my array look likes to be correct as

cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [raidF1]
md2 : active raid5 sdk5[1] sdh5[7] sdi5[6] sdm5[8] sdn5[4] sdg5[3] sdj5[2]
27315312192 blocks super 1.2 level 5, 64k chunk, algorithm 2 [8/7] [_UUUUUUU]

md3 : active raid5 sdk6[1] sdm6[7] sdh6[6] sdi6[5] sdn6[4] sdg6[3] sdj6[2]
6837200384 blocks super 1.2 level 5, 64k chunk, algorithm 2 [8/7] [_UUUUUUU]

md5 : active raid5 sdi8[0] sdh8[1]
3904788864 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/2] [UU_]

md4 : active raid5 sdn7[0] sdm7[3] sdh7[2] sdi7[1]
11720987648 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/4] [UUUU_]

md1 : active raid1 sdg2[0] sdh2[5] sdi2[4] sdj2[1] sdk2[2] sdl2[3] sdm2[7] sdn2[6]
2097088 blocks [14/8] [UUUUUUUU______]

md0 : active raid1 sdg1[0] sdh1[5] sdi1[4] sdj1[1] sdk1[2] sdl1[3] sdm1[6] sdn1[7]
2490176 blocks [12/8] [UUUUUUUU____]

unused devices: <none>

only the IronWold disk is consider as faulty ... not sure rebuild array will reset the disk error

It is completly crazy, the disk are all normal, the arry is fully accessible, but DSM consider a disk as faulty and block any action (including adding drive etc.)

Thx

Edited December 26, 2019 by satdream

Help I lost access to ~45 Tb of data ...

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation