new sata/ahci cards with more then 4 ports (and no sata multiplexer)

IG-88 · September 11, 2023

On 9/10/2023 at 11:30 AM, Peter Suh said:

I already have two ASM 1166 M.2 NVMe to Sata products, so I will test them when I get the chance. It is expected that 10 disks can be installed

at least 10, it should be 12

as long as the board/bios support bifurcation it should work

but at some point the housing need to be bigger and the board will not be limited to mini-itx and its single 16x slot and with a bigger atx/e-atx board there might be enough real pcie slots so no need to mess around with m.2 adapters

or the other direction might be bigger disks - thats what i did a while ago by replacing 12x4TB with 4x16TB (now my 6 x sata onboard are more then enough)

also power consumption and noise can be a factor when thinking about more disks or replacing disks

On 9/10/2023 at 11:30 AM, Peter Suh said:

I purchased the 15cm product and am using it on both the 1st main and 2nd backup NAS in my profile without any interference.

i bought 2 different adapters with m.2 to pcie, the longer one (~25cm) did not work reliable enough for storage, there where some strange log messages indication problems, so keeping it as short as possible and testing it before using in an a productive system would be important, there can be quality differences even when ordering the same item

Edited September 11, 2023 by IG-88

IG-88 · September 11, 2023

On 9/10/2023 at 1:48 PM, RedwinX said:

I used the card provide with the amazon link, and I have 4 HDD plugged in and detected into DSM.

Would you like I provide some result of command? (lspci or anything)?

yes lspci would clear things as the pci id should be different for asm1162 and asm1164

1B21:1166, 1B21:1164, 1B21:1162

Edited September 11, 2023 by IG-88

RedwinX · September 12, 2023

14 hours ago, IG-88 said:

yes lspci would clear things as the pci id should be different for asm1162 and asm1164

1B21:1166, 1B21:1164, 1B21:1162

In my opinion, the description provide on Amazon is false. id is 1064 and not 1062...

Edited September 12, 2023 by RedwinX

Peter Suh · December 13, 2023

On 5/21/2023 at 7:06 PM, IG-88 said:

some news for M.2 nvme

sliverstone has a new M.2 5-port card "ECS07", as its 5 port it will most likely be jmb585 (costs around $65)

as seen in the picture it addresses the mechanical problems with a steel casing (looks like some sheet metal taken from a nvme cooling solution, that might also be a good DIY way to handle it and a easier to place cooling can be seen as a bonus)

https://www.silverstonetek.com/en/product/info/expansion-cards/ECS07/

I recently got an Amazon discount and purchased this product for slightly less than $65.

I don't know if it's an issue with the JMB585 chipset, but the ECS07 product has instability issues when booting.
I bought it trusting the brand Silverstone, but I was a little disappointed.

If I use Quickboot in BIOS, when the PC reboots, it will not find any disks mounted on this device.
So, as an alternative, I disabled quickboot and added a slight boot delay. After that, these phenomena no longer occur.
However, it is also difficult to trust.

IG-88 · December 13, 2023

14 hours ago, Peter Suh said:

If I use Quickboot in BIOS, when the PC reboots, it will not find any disks mounted on this device.
So, as an alternative, I disabled quickboot and added a slight boot delay. After that, these phenomena no longer occur.

interesting i would have expected that it would be enough to have the linux kernel (and its ahci driver) getting loaded to find the disks (we dont need them to boot so no need to have them ready so early as we are booting from usb and the kernel is loaded from usb)

Peter Suh · December 14, 2023

4 hours ago, IG-88 said:

interesting i would have expected that it would be enough to have the linux kernel (and its ahci driver) getting loaded to find the disks (we dont need them to boot so no need to have them ready so early as we are booting from usb and the kernel is loaded from usb)

you're right. I can wait a little while. I don't even wait that long. It suddenly occurs to me that a similar phenomenon may also exist in ASM1166.

I will test the ASM1166 quick boot at home after work.

In the case of HBA (SAS2008), if you are using legacy IT MODE firmware,

there was advice from an expert that it would be advantageous to change the storage part of the CMOS BIOS from UEFI to legacy to give you more time to find the disk.
I'm not sure about the exact mechanism, but I think it's okay to give the disk controllers some timing to find the disk.

vbz14216 · December 20, 2023

On 2/11/2021 at 8:48 PM, IG-88 said:

ok, id did some more tests, nothing in case of reconnections (that points to interface/cable/backlane/connectors) there where still zero but i did see something "unusual" in the dmesg log but only for WD disks (had two 500GB disks one 2.5" the other 3.5") nothing like that with HGST, Samsung, Seagate or a Crucuial SSD MX300

[   98.256360] md: md2: current auto_remap = 0
[   98.256363] md: requested-resync of RAID array md2
[   98.256366] md: minimum _guaranteed_  speed: 10000 KB/sec/disk.
[   98.256366] md: using maximum available idle IO bandwidth (but not more than 600000 KB/sec) for requested-resync.
[   98.256370] md: using 128k window, over a total of 483564544k.
[  184.817938] ata5.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
[  184.825608] ata5.00: failed command: READ FPDMA QUEUED
[  184.830757] ata5.00: cmd 60/00:00:00:8a:cf/02:00:00:00:00/40 tag 0 ncq 262144 in
                        res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  184.845546] ata5.00: status: { DRDY }
[  184.849222] ata5.00: failed command: READ FPDMA QUEUED
[  184.854373] ata5.00: cmd 60/00:08:00:8c:cf/02:00:00:00:00/40 tag 1 ncq 262144 in
                        res 40/00:00:e0:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  184.869165] ata5.00: status: { DRDY }
[  184.872839] ata5.00: failed command: READ FPDMA QUEUED
[  184.877994] ata5.00: cmd 60/00:10:00:8e:cf/02:00:00:00:00/40 tag 2 ncq 262144 in
                        res 40/00:00:e0:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  184.892784] ata5.00: status: { DRDY }
...
[  185.559602] ata5: hard resetting link
[  186.018820] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  186.022265] ata5.00: configured for UDMA/100
[  186.022286] ata5.00: device reported invalid CHS sector 0
[  186.022331] ata5: EH complete
[  311.788536] ata5.00: exception Emask 0x0 SAct 0x7ffe0003 SErr 0x0 action 0x6 frozen
[  311.796228] ata5.00: failed command: READ FPDMA QUEUED
[  311.801372] ata5.00: cmd 60/e0:00:88:3a:8e/00:00:01:00:00/40 tag 0 ncq 114688 in
                        res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  311.816151] ata5.00: status: { DRDY }
...
[  312.171072] ata5.00: status: { DRDY }
[  312.174841] ata5: hard resetting link
[  312.634480] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  312.637992] ata5.00: configured for UDMA/100
[  312.638002] ata5.00: device reported invalid CHS sector 0
[  312.638034] ata5: EH complete
[  572.892855] ata5.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
[  572.900523] ata5.00: failed command: READ FPDMA QUEUED
[  572.905680] ata5.00: cmd 60/00:00:78:0a:ec/02:00:03:00:00/40 tag 0 ncq 262144 in
                        res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  572.920462] ata5.00: status: { DRDY }
...
[  573.630587] ata5.00: status: { DRDY }
[  573.634262] ata5: hard resetting link
[  574.093716] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  574.096662] ata5.00: configured for UDMA/100
[  574.096688] ata5.00: device reported invalid CHS sector 0
[  574.096732] ata5: EH complete
[  668.887853] ata5.00: NCQ disabled due to excessive errors
[  668.887857] ata5.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
[  668.895522] ata5.00: failed command: READ FPDMA QUEUED
[  668.900667] ata5.00: cmd 60/00:00:98:67:53/02:00:04:00:00/40 tag 0 ncq 262144 in
                        res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  668.915449] ata5.00: status: { DRDY }
...
[  669.601057] ata5.00: status: { DRDY }
[  669.604730] ata5.00: failed command: READ FPDMA QUEUED
[  669.609879] ata5.00: cmd 60/00:f0:98:65:53/02:00:04:00:00/40 tag 30 ncq 262144 in
                        res 40/00:00:e0:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  669.624748] ata5.00: status: { DRDY }
[  669.628425] ata5: hard resetting link
[  670.087717] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  670.090796] ata5.00: configured for UDMA/100
[  670.090814] ata5.00: device reported invalid CHS sector 0
[  670.090859] ata5: EH complete
[ 6108.391162] md: md2: requested-resync done.
[ 6108.646861] md: md2: current auto_remap = 0

i could shift the problem between ports by changing the port so its specific for WD disks

it was on both kernels 3617 and 918+ (3.10.105 and 4.4.59)

as the error points to NCQ and i found references on the internet a tried to "fix" it by disabling NCQ for the kernel

i added "libata.force=noncq" to the kernel parameters in grub.cfg, rebooted and did the same procedure as before (with 918+) and i did not see the errors (there will be entry's about not using ncq for every disks, so its good to see that the kernel parameter is used as intended)

in theory it might be possible to just disable ncq for some disks that are really WD but that would need intervention later if anything is changed on the disks

in general there was not problem with the raid's i build even with the ncq errors and btrfs had nothing to complain

i'd suggest to use this when having WD disks in the system

i'm only using HGST and Seagate on the system with the jmb585 so it was not visible before on my main nas

Recently I got some used WD Red drives(WD20EFRX) and found the same dreaded FPDMA QUEUED error(from /var/log/messages). I can confirm the drives are healthy from Victoria scan and smartctl EXTENDED scan.

After googling around I still can't find a definitive answer. The big focus is around WD drives then other branded drives including SSDs and brand new drives. Some physical solutions/speculations are almost snake oil but some point into libata's "somewhat buggy"(don't take my words for it) implementation indicated from this git issue on openzfs: https://github.com/openzfs/zfs/issues/10094

You might want to check if some md* recovers each time you power up(with FPDMA QUEUED error on last boot), I found out one of the partition from a WD Red got kicked out when I investigated in /var/log/messages. I got md1(swap?) recovers but good thing the storage volume didn't degrade from it.

Of course the easiest way is to use SAS HBAs that use proprietary drivers like mptsas found on LSI cards, which should bypasses this. But I use Intel onboard SATA ports, don't want to deal with extra heat and power from a SAS card. Disabling NCQ from kernel cmdline is certainly the nuke-it-all solution but I have other drives that doesn't suffer from this problem.

So I found out you can "disable" the NCQ on individual drive(s) by setting queue_depth parameter:

echo 1 > /sys/block/sdX/device/queue_depth

Wheres sdX is your hard drive that you want to disable, the default is 31 and we're lowering it to 1 which effectively disables NCQ. Do keep in mind this gets reset between reboots so you should schedule a power-on task with root privileges.

i.e. I wanted to disable NCQ on drive sdc and sdd, just use semicolon to separate the commands:

echo 1 > /sys/block/sdc/device/queue_depth ; echo 1 > /sys/block/sdd/device/queue_depth

I do not take responsibility on data corruption, use at your own risk. Perform full backup before committing anything.

I'm not sure if FPDMA QUEUE errors can cause severe data corruption or not but better safe than sorry. I already did a load of defrag, data scrub and SMART tests to check this.

Another way is using hdparm -Q to set it from firmware level but I don't have any experiences with hdparm and some drives might not support it well so there's that(i.e. does not persist after power cycle). I will keep monitoring /var/log/messages and see if FPDMA QUEUED error still pop up.

Edited December 20, 2023 by vbz14216
fix error

vbz14216 · January 3

Reporting on disabling NCQ to workaround FPDMA QUEUED error, so far so good.

I found a variant of libata kernel cmdline that disables NCQ on individual ports:

libata.force=X.00:noncq,Y.00:noncq

Wheres X,Y and so on(separated with comma)= the port you're disabling, can be checked from:

dmesg | grep -i ATA

It's the way to disable NCQ before scheduled tasks can be executed. I prefer this plus writing to queue_depth via power on task just to be safe, though queue_depth should already be 1 if you have done the kernel cmdline. (i.e. you have messed up your system partition or forgot to put the kernel cmdline after rebuilding your loader)

I also tested hdparm but as expected, disabling NCQ from hdparm doesn't persist between reboots so it's better to disable via kernel cmdline.

As for md* resync I previously mentioned(md1), that seemed to be a quirk with the built in performance benchmark. (I checked performance after disabling NCQ.) It's a false alarm but still a good idea to check if there are other ata errors that kicked your drive(s) out from unrelated errors.

Kanst · March 5

В 25.02.2023 в 20:52, IG-88 сказал:

asm1166...

i did see a problem like this with my card (32 ports recognized)

I'm build a test device on 6xSATA m/b + 2x 6ports NVME 1166 3622xs+.

It work. With blank ("SataPortMap" : "" and "DiskIdxMap" : "") DSM7.2.1 see drives 1-12 and 39-44.

I tried sata_remap in user_config and sata comandline (39\\>13...44\\>18, 39\>13...44\>18, 39>13:13>39...), but in all this cases DSM can see only drive 39.

What about a right sintacs? Or its useless in DSM7? How can I remap them and set other ports as dummy?

Edited March 5 by Kanst

Kanst · March 5

Now I add 3rd ASM1166 card (PCIe4x) and has Discs 71-76 in DSM.

As I understand without satamap configured it work in DT mode, but i has no search results on forum about device tree manual configure.

How I can remap drives 39-44 to 13-18, 71-76 to 19-24 and disable all useless, if needed?

Edited March 5 by Kanst

IG-88 · March 7

On 3/5/2024 at 1:08 AM, Kanst said:

What about a right sintacs?

sata_remap=9>0:0>9

https://xpenology.com/forum/topic/32867-sata-and-sas-config-commands-in-grubcfg-and-what-they-do/

https://gugucomputing.wordpress.com/2018/11/11/experiment-on-sata_args-in-grub-cfg/

did you see this`?

https://xpenology.com/forum/topic/52094-how-to-config-sataportmap-sata_remap-and-diskidxmap/

also a possible solution might be to just use one asm1166 and place it as last card, that way the 32 ports are no problem

like 6 x sata onbard, 5 x sata with jmb585, 6 x sata with asm1166

if needed another jmb585 or jmb582 card can be placed in the middle to keep asm1166 last, jmb582 will be a pcie1x card but sometime all the good slots are all used but even a 1x slot can be useful (afair there a re even jmb585 with pcie 1x but using to many of the ports might result in some performance degradation)

there is also a newer firmware from 11/2022 for asm1166 (at least newer the the one from silverstone) but it does not fix the 32 port problem

https://winraid.level1techs.com/t/latest-firmware-for-asm1064-1166-sata-controllers/98543/18

Kanst · March 8

22 часа назад, IG-88 сказал:

there is also a newer firmware from 11/2022 for asm1166

Yes, I try it on one controller only (the last 6 ports), because don't understand yet how to backup stock firmwares.

22 часа назад, IG-88 сказал:

sata_remap=9>0:0>9

https://xpenology.com/forum/topic/32867-sata-and-sas-config-commands-in-grubcfg-and-what-they-do/

https://gugucomputing.wordpress.com/2018/11/11/experiment-on-sata_args-in-grub-cfg/

did you see this`?

Yes, I use all of this in my "main" NAS. But with my 1166 it doesn't work...

I has 10 HDDs for tests. 4 connected to m/b sata ports, 6 to "last" 1166 (Disks 71-76 in DeviceTree mode) and create SHR2 on this 6. (user config edited for 24 ports)

I try reconnection this RAID to another two 1166 controllers without any problems.
Then insert in user_config SataPortMap=6666 and DiskIdxMap=00060c12 and after DSM recovering lost first 2 hdds from 6.

But RAID is not degrade!!!

Спойлер

Back through restoring loader and DSM recovering to DT-mode with degraded RAID.

QD4_6666-backDT.PNG.8b28d1b79102dc87329ab9d3f7066f05.PNG

And DSM gave me restore RAID only by using another 2 free discs.

Then try autoconfigure (./rploader.sh satamap)

1. 6 ports/4drives connected

2. 32 ports/0 drive connected (-57 -56 is a bad ports)

3. 32 ports/0 drive connected (-25 -24 is a bad ports)

4. 32 ports/6 drive connected (7 8 is a bad ports)...

Set SataPortMap=6888 and DiskIdxMap=00060e16, append "ff" for 32 port mode.

After DSM recovering again (when it find, "that I insert discs from another Syno")

I find in DSM only 2 from 1166 connected discs with "System Partition Damage".

PS. sata_remap no work in any variants.
Loader with sata_remap has first two boot menu items only

Editing grub (pressing "e" in boot menu) unswer, that grub has no sata_remap command.

Edited March 8 by Kanst

IG-88 · March 9

On 3/8/2024 at 10:06 PM, Kanst said:

I try reconnection this RAID to another two 1166 controllers without any problems.
Then insert in user_config SataPortMap=6666 and DiskIdxMap=00060c12 and after DSM recovering lost first 2 hdds from 6.

But RAID is not degrade!!!

check what mdadm has to say about that

cat /proc/mdadm

at least there should be something if disks from a raid are missing

On 3/8/2024 at 10:06 PM, Kanst said:

PS. sata_remap no work in any variants.

my (untested) assumption was that these things might not work on DT models

they might use different things now like changes in device tree?

maybe changes to a non DT model for your install or as suggested earlier change to jmb585/582 cards to get the port count you are aiming for

you can try to dive deep into DT, syno's kernel (kernel source is available), the mods they have done and the shiming in the loader ... the less time consuming and non developer way is just to circumvent problems and using asm1166 only as last controller in the system is that way (or not using it at all

or if you did not already have bough disks just lower the needed port count with bigger disks (i reduced my system from 12 to 5 disks that way)

On 3/8/2024 at 10:06 PM, Kanst said:

Editing grub (pressing "e" in boot menu) unswer, that grub has no sata_remap command.

that might have been the way with jun's loader but the new loader (rp) works different, you would need to edit a config file of the loader for that, the loader now has its own boot and menu system to do that and re-writes the config file when saving the loader config (if you change the resulting config file manually you changes might get lost when re-running loader config later (like having DT and needing to renew the device tree after changing hardware)

IG-88 · March 10

arc loaders wiki can also be a good source of information

https://github.com/AuxXxilium/AuxXxilium/wiki

(the important and limitations part from the start page)

https://github.com/AuxXxilium/AuxXxilium/wiki/Arc:-Notice-&-Workarounds

https://github.com/AuxXxilium/AuxXxilium/wiki/Arc:-SataPortMap-&-SataRemap

Kanst · March 10

ARC Wiki tell: "SataPortMap & SataRemap only necessary for nonDT Models".

Linux documents tell, that DT is a part of Linux kernel (compiled text file)

As I understand, for changing device tree I need kernel recompiling (loader's kernel or DSM's kernel? )

at least it's unavaliable for me yet.
OK, let it be 1-12, 39-44, 71-76 drives numbers in DT-mode, generally they work well.

Kanst · March 23

It's a miracle. Today trying ARC-loader 24.3.21 I set again SataPortMap : "6666" and DiskIdxMap : "00060c12" and now it's work!!!
(6xSATA on m/b + 2x6ports NVME ASM1166 + 1x6ports PCIe4x ASM1166 as 3622xs+)

vbz14216 · April 14

Well well well, the WD drives mentioned a few months ago are still ok. But 1 out of my 2x HGST 7K1000 without noncq which ran fine for around year suddenly got kicked out of its own RAID1 array(within its own storage volume), degrading the volume. During that time I was testing Hyper Backup to another NAS, the HB ended without issues so I still have a known good backup in case if anything went wrong.

dmesg and /var/log/messages listed some (READ) FPDMA QUEUED error and the ATA link was reset twice before md decided to kick the partitions out. I powered down the NAS, used another PC to make sure the drive was all good(no bad sectors) and erased the data preparing for a clean array repair. Before loading the kernel I added more noncq parameter to libata for all remaining SATA ports.

Deactivate the drive, then a power cycle to make it recognized as an unused drive. The array was successfully repaired after some time, followed by a good BTRFS scrub.

Analysis:

This drive simply went "Critical" without any I/O error or bad sector notice from log center.

/proc/mdstat showed the drive(sdfX) got kicked out from md0(system partition) and md3(my 2nd storage array, mdX at whichever your array was).

Interestingly enough md1(SWAP) was still going, indicating the disk was still recognized by the system(instead of a dead drive).

root@NAS:~# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sdd3[0] sdf3[1](F)
      966038208 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sde3[3] sdc3[2]
      1942790208 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdf2[0] sdd2[3] sde2[2] sdc2[1]
      2097088 blocks [12/4] [UUUU________]

md0 : active raid1 sde1[0] sdc1[3] sdd1[2] sdf1[12](F)
      2490176 blocks [12/3] [U_UU________]

The other drive with identical model in the same R1 array has a different firmware and fortunately didn't suffer from this, preventing a complete volume crash. Upon reading /var/log/messages I assume md prioritized the dropped drive for reading data from array, which caused the drive to get kicked out in the first place:

Spoiler

[986753.706557] ata12.00: exception Emask 0x0 SAct 0x7e SErr 0x0 action 0x6 frozen
[986753.710727] ata12.00: failed command: READ FPDMA QUEUED
[986753.713952] ata12.00: cmd 60/c0:08:b8:13:dc/02:00:50:00:00/40 tag 1 ncq 360448 in
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[986753.722301] ata12.00: status: { DRDY }
[986753.724547] ata12.00: failed command: READ FPDMA QUEUED
[986753.727696] ata12.00: cmd 60/40:10:78:16:dc/05:00:50:00:00/40 tag 2 ncq 688128 in
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[986753.735403] ata12.00: status: { DRDY }
[986753.736874] ata12.00: failed command: READ FPDMA QUEUED
[986753.738907] ata12.00: cmd 60/c0:18:b8:1b:dc/02:00:50:00:00/40 tag 3 ncq 360448 in
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[986753.743065] ata12.00: status: { DRDY }
[986753.744117] ata12.00: failed command: READ FPDMA QUEUED
[986753.745384] ata12.00: cmd 60/40:20:78:1e:dc/05:00:50:00:00/40 tag 4 ncq 688128 in
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[986753.748680] ata12.00: status: { DRDY }
[986753.749443] ata12.00: failed command: READ FPDMA QUEUED
[986753.750440] ata12.00: cmd 60/c0:28:b8:23:dc/02:00:50:00:00/40 tag 5 ncq 360448 in
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[986753.753155] ata12.00: status: { DRDY }
[986753.753865] ata12.00: failed command: READ FPDMA QUEUED
[986753.754728] ata12.00: cmd 60/20:30:60:5d:0c/00:00:49:00:00/40 tag 6 ncq 16384 in
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[986753.757038] ata12.00: status: { DRDY }
[986753.757665] ata12: hard resetting link
[986754.063209] ata12: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[986754.071721] ata12.00: configured for UDMA/133
[986754.074767] ata12: EH complete
[986834.250399] ata12.00: exception Emask 0x0 SAct 0x400000c SErr 0x0 action 0x6 frozen
[986834.255900] ata12.00: failed command: READ FPDMA QUEUED
[986834.259488] ata12.00: cmd 60/08:10:78:c9:23/00:00:00:00:00/40 tag 2 ncq 4096 in
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[986834.268422] ata12.00: status: { DRDY }
[986834.270857] ata12.00: failed command: READ FPDMA QUEUED
[986834.274088] ata12.00: cmd 60/20:18:88:c9:23/00:00:00:00:00/40 tag 3 ncq 16384 in
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[986834.281356] ata12.00: status: { DRDY }
[986834.283040] ata12.00: failed command: READ FPDMA QUEUED
[986834.284756] ata12.00: cmd 60/00:d0:78:7e:dc/08:00:50:00:00/40 tag 26 ncq 1048576 in
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[986834.288914] ata12.00: status: { DRDY }
[986834.289840] ata12: hard resetting link
[986834.595190] ata12: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[986834.603569] ata12.00: configured for UDMA/133
[986834.606420] sd 11:0:0:0: [sdf] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[986834.611444] sd 11:0:0:0: [sdf] tag#4 CDB: opcode=0x2a 2a 00 00 4c 05 00 00 00 08 00
[986834.615933] blk_update_request: I/O error, dev sdf, sector in range 4980736 + 0-2(12)
[986834.620335] write error, md0, sdf1 index [5], sector 4973824 [raid1_end_write_request]
[986834.624921] md_error: sdf1 is being to be set faulty
[986834.628120] raid1: Disk failure on sdf1, disabling device.
                        Operation continuing on 3 devices
[986834.632234] sd 11:0:0:0: [sdf] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[986834.634988] sd 11:0:0:0: [sdf] tag#1 CDB: opcode=0x2a 2a 00 21 44 8c 80 00 00 08 00
[986834.637359] blk_update_request: I/O error, dev sdf, sector in range 558137344 + 0-2(12)
[986834.639445] write error, md3, sdf3 index [5], sector 536898688 [raid1_end_write_request]
[986834.641378] md_error: sdf3 is being to be set faulty
[986834.642649] raid1: Disk failure on sdf3, disabling device.
                        Operation continuing on 1 devices

Continuing with a BTRFS warning followed by md doing its magic by rescheduling(switch over for redundancy):

Spoiler

[986834.644773] sd 11:0:0:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[986834.646397] sd 11:0:0:0: [sdf] tag#0 CDB: opcode=0x2a 2a 00 01 46 8c 80 00 00 08 00
[986834.647959] blk_update_request: I/O error, dev sdf, sector in range 21397504 + 0-2(12)
[986834.649364] write error, md3, sdf3 index [5], sector 158848 [raid1_end_write_request]
[986834.650706] sd 11:0:0:0: [sdf] tag#30 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[986834.652158] sd 11:0:0:0: [sdf] tag#30 CDB: opcode=0x2a 2a 00 01 44 8d 00 00 00 08 00
[986834.653447] blk_update_request: I/O error, dev sdf, sector in range 21266432 + 0-2(12)
[986834.654598] blk_update_request: I/O error, dev sdf, sector 21269760
[986834.655496] write error, md3, sdf3 index [5], sector 27904 [raid1_end_write_request]
[986834.656625] sd 11:0:0:0: [sdf] tag#29 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[986834.657836] sd 11:0:0:0: [sdf] tag#29 CDB: opcode=0x2a 2a 00 00 22 1f e0 00 00 18 00
[986834.658872] blk_update_request: I/O error, dev sdf, sector in range 2232320 + 0-2(12)
[986834.659881] write error, md0, sdf1 index [5], sector 2228192 [raid1_end_write_request]
[986834.660904] write error, md0, sdf1 index [5], sector 2228200 [raid1_end_write_request]
[986834.661987] write error, md0, sdf1 index [5], sector 2228208 [raid1_end_write_request]
[986834.663109] sd 11:0:0:0: [sdf] tag#28 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[986834.663452] BTRFS warning (device dm-4): commit trans:
                total_time: 108717, meta-read[miss/total]:[137/4487], meta-write[count/size]:[25/464 K]
                prepare phase: time: 0, refs[before/process/after]:[0/0/0]
                wait prev trans completed: time: 0
                pre-run delayed item phase: time: 0, inodes/items:[2/2]
                wait join end trans: time: 0
                run data refs for usrquota: time: 0, refs:[0]
                create snpashot: time: 0, inodes/items:[0/0], refs:[0]
                delayed item phase: time: 0, inodes/items:[0/0]
                delayed refs phase: time: 0, refs:[30]
                commit roots phase: time: 0
                writeback phase: time: 108715
[986834.673347] sd 11:0:0:0: [sdf] tag#28 CDB: opcode=0x2a 2a 00 00 4a a0 e8 00 00 08 00
[986834.674272] blk_update_request: I/O error, dev sdf, sector in range 4890624 + 0-2(12)
[986834.675173] write error, md0, sdf1 index [5], sector 4882664 [raid1_end_write_request]
[986834.676109] sd 11:0:0:0: [sdf] tag#27 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[986834.677076] sd 11:0:0:0: [sdf] tag#27 CDB: opcode=0x2a 2a 00 00 34 9c 88 00 00 08 00
[986834.677970] blk_update_request: I/O error, dev sdf, sector in range 3444736 + 0-2(12)
[986834.678871] write error, md0, sdf1 index [5], sector 3439752 [raid1_end_write_request]
[986834.679800] sd 11:0:0:0: [sdf] tag#25 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x06
[986834.680771] sd 11:0:0:0: [sdf] tag#25 CDB: opcode=0x28 28 00 50 dc 7b b8 00 02 c0 00
[986834.681701] blk_update_request: I/O error, dev sdf, sector in range 1356623872 + 0-2(12)
[986834.682641] md/raid1:md3: sdf3: rescheduling sector 1335382968
[986834.683365] ata12: EH complete

[986834.692871] RAID1 conf printout:
[986834.693267] --- wd:3 rd:12
[986834.693597] disk 0, wo:0, o:1, dev:sde1
[986834.694057] disk 1, wo:1, o:0, dev:sdf1
[986834.694496] disk 2, wo:0, o:1, dev:sdd1
[986834.694936] disk 3, wo:0, o:1, dev:sdc1
[986834.700036] RAID1 conf printout:
[986834.700428] --- wd:3 rd:12
[986834.700755] disk 0, wo:0, o:1, dev:sde1
[986834.701290] disk 2, wo:0, o:1, dev:sdd1
[986834.701773] disk 3, wo:0, o:1, dev:sdc1
[986842.735082] md/raid1:md3: redirecting sector 1335382968 to other mirror: sdd3
[986842.736900] RAID1 conf printout:
[986842.737795] --- wd:1 rd:2
[986842.738506] disk 0, wo:0, o:1, dev:sdd3
[986842.739414] disk 1, wo:1, o:0, dev:sdf3
[986842.746025] RAID1 conf printout:
[986842.746663] --- wd:1 rd:2
[986842.747216] disk 0, wo:0, o:1, dev:sdd3

There's no concrete evidence on what combination of hardware, software and firmware can cause this so there isn't much point in collecting setup data.

Boot parameter for disabling NCQ on select ports, or simply libata.force=noncq to rid of NCQ on all ports:

libata.force=X.00:noncq,Y.00:noncq

dmesg will say NCQ (not used) instead of NCQ (depth XX).

This mitigates libata NCQ quirks at the cost of some multitask performance. I only tried RAID 1 with HDDs, not sure if this affects RAID5/6/10 or Hybrid RAID performance.

NCQ bugs can also happen on SSDs with queued TRIM, reported as problematic on some SATA SSDs. A bugzilla thread on NCQ bugs affecting several Samsung SATA SSDs:

https://bugzilla.kernel.org/show_bug.cgi?id=201693

It's known NCQ does have some effects on TRIM(for SATA SSDs). That's why libata also has noncqtrim for SSDs, which doesn't disable NCQ as a whole but only the queued TRIM. Do note that writing 1 to /sys/block/sdX/device/queue_depth may not be the solution for mitigating queued TRIM bug on SSDs, as someone in that thread stated it doesn't work for them until noncq boot parameter is used. (I suppose noncqtrim should just do the trick, this was the libata quirk for those drives.)

Since this option doesn't seem to cause data intergrity issues, maybe can be added as advanced debug option for loader assistant/shell? Dunno which developer to tag. I suspect this is one of the oversights on random array degrades/crashes besides bad cables/backplanes.

For researchers: /drivers/ata/libata-core.c is the one to look into. There are some (old) drives applied with mitigation quirks.

Still...

I do not take responsibility on data corruption, use at your own risk. Perform full backup before committing anything.

Edited April 14 by vbz14216
fix error

new sata/ahci cards with more then 4 ports (and no sata multiplexer)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation