Jump to content
XPEnology Community

flyride

Moderator
  • Posts

    2,438
  • Joined

  • Last visited

  • Days Won

    127

Posts posted by flyride

  1. 3 hours ago, C-Fu said:

    I just changed from 750W psu to a 1600W psu that's fairly new (only a few day's use max), so I don't believe the PSU is the problem.

    When I get back on monday, I'll see if I can replace the whole system (I have a few motherboards unused) and cables and whatnot and reuse the SAS card if that's not likely the issue, and maybe reinstall Xpenology. Would that be a good idea? 

     

    If all your problems started after that power supply replacement, this further reinforces the idea of stable power. You seem reluctant to believe that a new power supply can be a problem (it can). For what it's worth, 13 drives x 5w equals 65w, that shouldn't be a factor.

     

    In any debugging and recovery operation, the objective should be to manage the change rate and therefore risk. Replacing the whole system would violate that strategy.

    • Do the drive connectivity failures implicate a SAS card problem?  Maybe, but a much more plausible explanation is physical connectivity or power.  If you have an identical SAS card, and it is passive (no intrinsic configuration required), replacing it is a low risk troubleshooting strategy.
    • Do failures implicate the motherboard? Maybe, if you are using on-board SATA ports, but the same plausibility test applies. However, there is more variability and risk (mobo model, BIOS settings, etc).
    • Do failures implicate DSM or loader stability?  Not at all; DSM boots fine and is not crashing.  And if you reinstall DSM, it's very likely your arrays will be destructively reconfigured.  Please don't do this.

    So I'll stand by (and extend) my previous statement - if this were my system, I would change your power and cables first. If that doesn't solve things, maybe the SAS card, and lastly the motherboard.

    • Thanks 1
  2. Your drives have reordered yet again.  I know IG-88 said your controller deliberately presents them contiguously (which is problematic in itself) but if all drives are up and stable, I cannot see why that behavior would cause a reorder on reboot. I remain very wary of your hardware consistency.

     

    Look through dmesg and see if you have any hardware problems since your power cycle boot.

    Run another hotswap query and see if any drives have changed state since your power cycle boot.

    Run another mdstat - is it still slow?

     

    • Sad 1
  3. To compare, we would have to

     

    fgrep "hotswap" /var/log/disk.log

     

    not "hotplug"

     

    But looking at your pastebin it appears that the only drive to hotplug out/in is sda, which doesn't affect us.  But why that happened is still concerning.

     

    Please run it again (with hotswap) and make sure there are no array drives changing after 2020-01-21T05:58:20+08:00

  4. I was referring to the array table I made not matching up to one of the mdstats.  This appears to suggest continuing hotplug operations (drives going up and down) at the moment you keyed the first array creation command, which might have made the first array test invalid.

     

    image.thumb.png.4bf0aaf71bede21e4d95d4035c3f6b49.png

     

    Continuing our current trajectory: First, repeat fgrep "hotplug" /var/log/disk.log and make sure there are no new entries in the log since the report above. If there are new entries, your system still isn't stable.  If it's the same as before, move forward with:

     

    # mdadm -Cf /dev/md2 -e1.2 -n13 -l5 --verbose --assume-clean /dev/sd[bcdefpqlmn]5 missing /dev/sdo5 /dev/sdk5  -u43699871:217306be:dc16f5e8:dcbe1b0d

    # cat /proc/mdstat

  5. Sorry, but please run the fdisk -l as individual commands.

     

    # fdisk -l /dev/sdb

    # fdisk -l /dev/sdc

    # fdisk -l /dev/sdd

    # fdisk -l /dev/sde

    # fdisk -l /dev/sdf

    # fdisk -l /dev/sdk

    # fdisk -l /dev/sdl

    # fdisk -l /dev/sdm

    # fdisk -l /dev/sdn

    # fdisk -l /dev/sdo

    # fdisk -l /dev/sdp

    # fdisk -l /dev/sdq

     

    Also:

     

    # fgrep "hotswap" /var/log/disk.log

    # date

     

  6. 2 minutes ago, C-Fu said:
    9 minutes ago, flyride said:

    Then if at all possible, disconnect the "old" 10TB drive and replace it with the other "new" 10TB without powering down the system.  Please install the "new" drive to the same port.  If you have to power down and power up the system, please post that you did.

     

    So just to be clear, you want me to take out the power and sata cable from the current 10TB, and put them into the "bad" 10TB?

    I don't have to power down btw.

     

    Yes, exactly that.

  7. Unfortunately this is the same result.  We moved the 10TB from slot 10 to slot 11, and the array is still not readable.

     

    24 minutes ago, C-Fu said:

    Would reinserting the "bad" 10TB and switch with the current 10TB, or add to the array help?

     

    We don't want to add to the array.  But replacing the 10TB drive is the next (and probably last) step.  This goes back to this conversation here:

    "now you have determined that the "good" 10TB drive is failed and removed it from the system, which leaves the "bad" superblock-zeroed 10TB drive as the only path to recover /dev/md2.  It is important to be correct about that, as it is possible that the "bad" 10TB drive has corrupted data on it due to the failed resync"

     

    So here's how this needs to go.

     

    # vgchange -an

    # mdadm --examine /dev/sd?5

    # mdadm --stop /dev/md2

     

    Post the result of the examine.

     

    Then if at all possible, disconnect the "old" 10TB drive and replace it with the other "new" 10TB without powering down the system.  Please install the "new" drive to the same port.  If you have to power down and power up the system, please post that you did.

     

    Assuming the drive is stable and not going up and down, then:

     

    # mdadm --examine /dev/sd?5

    # cat /proc/mdstat

     

    Post the result of both commands.

  8. Ok, nothing new got added to dmesg.  Because /dev/md2 is first in the vg, this is definitively telling us that /dev/md2 is not good.

     

    We are starting to run low on options. If for some reason we have the position of the 10TB drive wrong (slot 10), we can try the other spot (slot 11).  A more likely scenario is that the drive is damaged from corrupt and we are out of luck for this drive. But let's test the wrong slot possibility:

     

    # vgchange -an

    # mdadm --stop /dev/md2
    # mdadm -Cf /dev/md2 -e1.2 -n13 -l5 --verbose --assume-clean /dev/sd[bcdefpqlmn]5 missing /dev/sdo5 /dev/sdk5 -u43699871:217306be:dc16f5e8:dcbe1b0d  
    # vgchange -ay
    # mount -t btrfs -o ro,nologreplay /dev/vg1/volume_1 /volume1

     

    Post results of all commands.

×
×
  • Create New...