NAS 542 raid 5 volume down

Options
1235»

All Replies

  • BjoWis
    BjoWis Posts: 33  Freshman Member
    10 Comments Friend Collector
    Options
    Ok, thanks for your input - quite a bit to take in :)
    If you have pulled all disks except the 'ddrecued one', you can run 'mdadm --examine /dev/sd[abcd]3 on it, to see if it is indeed the problem one.
    I removed all disks except the ddrescued one (Disk3);
    ~ # mdadm --examine /dev/sd[abcd]3
    /dev/sdc3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
               Name : NAS542:2  (local to host NAS542)
      Creation Time : Wed Oct  5 13:17:25 2022
         Raid Level : raid5
       Raid Devices : 4

     Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
         Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : 3e046eeb:3bac7e38:2c4e2408:d05d7a1d

        Update Time : Thu Nov 10 16:51:14 2022
           Checksum : 6084a10 - correct
             Events : 1148

             Layout : left-symmetric
         Chunk Size : 64K

       Device Role : Active device 3
       Array State : .A.A ('A' == active, '.' == missing)
    ~ #

    But it says "Device Role : Active device 3"

    you wrote:

    In that case it's 'Device role' is 'Active device 2'. In that case this shouldn't have happened.

    So now I don't know if that disk is the problem or not?



  • Mijzelf
    Mijzelf Posts: 2,625  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options
    If all the info you have given me is correct, the 'ddrescued one' is not the problem:
    =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2022.11.10 16:46:58 =~=~=~=~=~=~=~=~=~=~=~=
    <snip>
    ~ # mdadm --examine /dev/sd[abcd]3
    <snip>
    /dev/sdc3:
    <snip>
       Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
    <snip>
       Device Role : Active device 2
       Array State : AAAA ('A' == active, '.' == missing)

    mdadm reads here the raid header from the disk. Each header is unique, so there is only one disk in the world with that 'Array UUID' and that 'Device Role'. This disk is the one which was dropped (the 'Array State' tells that. It shows a healthy array, as this header is not updated after the drop, and dmesg told the same.)
    The dump in your post is another disk. It's also called sdc, but unfortunately (in this case) that doesn't say much. On boot all disks get a device name, the first one found gets sda, the second sdb, ... . So the sequence of finding generates the device name, and that can change. For that reason you have to look at the content (or the serial number) to identify a disk.
    In this case I'm surprised the only disk in the box showed up as sdc, as I expected sda. Probably you have 2 usb sticks and/or SD cards? Or you hotpulled the disks?

    For this reason it's important to look at the 'Device Roles' of all disks before executing 'mdadm --create', as that command eats the volatile device names, and generates new headers, overwriting the old roles.

  • BjoWis
    BjoWis Posts: 33  Freshman Member
    10 Comments Friend Collector
    Options
    Hmm... I hotpulled the disks :s maybe not the smartest way? - Sorry!

    After rebooting the NAS (with only the ddrescued disk 3 inserted) I get;
    =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2022.11.11 11:46:11 =~=~=~=~=~=~=~=~=~=~=~=
    login as: admin
    admin@192.168.1.157's password:


    BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    ~ $ su
    Password:


    BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    ~ # mdadm --examine /dev/sd[abcd]3
    /dev/sda3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
               Name : NAS542:2  (local to host NAS542)
      Creation Time : Wed Oct  5 13:17:25 2022
         Raid Level : raid5
       Raid Devices : 4

     Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
         Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : 3e046eeb:3bac7e38:2c4e2408:d05d7a1d

        Update Time : Thu Nov 10 16:51:14 2022
           Checksum : 6084a10 - correct
             Events : 1148

             Layout : left-symmetric
         Chunk Size : 64K

       Device Role : Active device 3
       Array State : .A.A ('A' == active, '.' == missing)
    mdadm: cannot open /dev/sdb3: No such device or address
    mdadm: cannot open /dev/sdc3: No such device or address
    mdadm: cannot open /dev/sdd3: No such device or address
    ~ #

    Rebooting again with only disk 2 inserted;
    =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2022.11.11 11:54:50 =~=~=~=~=~=~=~=~=~=~=~=
    login as: admin
    admin@192.168.1.157's password:


    BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    ~ $ su
    Password:


    BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    ~ # mdadm --examine /dev/sd[abcd]3
    /dev/sda3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
               Name : NAS542:2  (local to host NAS542)
      Creation Time : Wed Oct  5 13:17:25 2022
         Raid Level : raid5
       Raid Devices : 4

     Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
         Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : f7c083fd:fe37f383:55424937:52ec4bd2

        Update Time : Thu Nov 10 16:51:14 2022
           Checksum : 4788153a - correct
             Events : 1148

             Layout : left-symmetric
         Chunk Size : 64K

       Device Role : Active device 1
       Array State : .AAA ('A' == active, '.' == missing)
    mdadm: cannot open /dev/sdb3: No such device or address
    mdadm: cannot open /dev/sdc3: No such device or address
    mdadm: cannot open /dev/sdd3: No such device or address
    ~ #
    Rebooting again with only disk 4 inserted
    =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2022.11.11 12:01:35 =~=~=~=~=~=~=~=~=~=~=~=
    login as: admin
    admin@192.168.1.157's password:


    BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    ~ $ su
    Password:


    BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    ~ # mdadm --examine /dev/sd[abcd]3
    /dev/sda3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
               Name : NAS542:2  (local to host NAS542)
      Creation Time : Wed Oct  5 13:17:25 2022
         Raid Level : raid5
       Raid Devices : 4

     Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
         Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : 803b4d20:57570076:2d7c62d1:e4bf567a

        Update Time : Thu Nov 10 09:48:48 2022
           Checksum : 9e815985 - correct
             Events : 1148

             Layout : left-symmetric
         Chunk Size : 64K

       Device Role : Active device 2
       Array State : AAAA ('A' == active, '.' == missing)
    mdadm: cannot open /dev/sdb3: No such device or address
    mdadm: cannot open /dev/sdc3: No such device or address
    mdadm: cannot open /dev/sdd3: No such device or address
    ~ #
    Disk 2 is device 1
    Disk 3 is device 3
    Disk 4 is device 2


  • Mijzelf
    Mijzelf Posts: 2,625  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options
    Hmm... I hotpulled the disks s maybe not the smartest way? - Sorry!

    No problem. Both the disks and the NAS support hotplugging and -pulling. It's just that the device node names get a bit unpredictable.

    Disk 4 is device 2
    So that is the one which was dropped. Big question is now, is that disk also dying, or was this a sector which was not accessed for a long time (maybe ever), which happened to have an invalid checksum. The 'Current Pending Sector' value of SMART can tell if the 2nd option is possible.
    smartctl -a /dev/sda
    (If you didn't change the disks meanwhile.)
  • BjoWis
    BjoWis Posts: 33  Freshman Member
    10 Comments Friend Collector
    Options
    Ok, here's the result;

    =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2022.11.12 15:24:32 =~=~=~=~=~=~=~=~=~=~=~=
    login as: admin
    admin@192.168.1.157's password:


    BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    ~ $ su
    Password:


    BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    ~ # smartctl -a /dev/sda
    smartctl 6.3 2014-07-26 r3976 [armv7l-linux-3.2.54] (local build)
    Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF INFORMATION SECTION ===
    Model Family:     Seagate Barracuda 7200.14 (AF)
    Device Model:     ST3000DM001-1CH166
    Serial Number:    Z1F4GQ1T
    LU WWN Device Id: 5 000c50 065c152e8
    Firmware Version: CC27
    User Capacity:    3,000,592,982,016 bytes [3.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
    Local Time is:    Sat Nov 12 15:26:32 2022 GMT
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    See vendor-specific Attribute list for marginal Attributes.

    General SMART Values:
    Offline data collection status:  (0x00)    Offline data collection activity
                        was never started.
                        Auto Offline Data Collection: Disabled.
    Self-test execution status:      (   0)    The previous self-test routine completed
                        without error or no self-test has ever
                        been run.
    Total time to complete Offline
    data collection:         (  584) seconds.
    Offline data collection
    capabilities:              (0x73) SMART execute Offline immediate.
                        Auto Offline data collection on/off support.
                        Suspend Offline collection upon new
                        command.
                        No Offline surface scan supported.
                        Self-test supported.
                        Conveyance Self-test supported.
                        Selective Self-test supported.
    SMART capabilities:            (0x0003)    Saves SMART data before entering
                        power-saving mode.
                        Supports SMART auto save timer.
    Error logging capability:        (0x01)    Error logging supported.
                        General Purpose Logging supported.
    Short self-test routine
    recommended polling time:      (   1) minutes.
    Extended self-test routine
    recommended polling time:      ( 339) minutes.
    Conveyance self-test routine
    recommended polling time:      (   2) minutes.
    SCT capabilities:            (0x3085)    SCT Status supported.

    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   110   099   006    Pre-fail  Always       -       188790814
      3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
      4 Start_Stop_Count        0x0032   091   091   020    Old_age   Always       -       9763
      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always       -       23202616
      9 Power_On_Hours          0x0032   030   030   000    Old_age   Always       -       61924
     10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       123
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
    184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
    187 Reported_Uncorrect      0x0032   064   064   000    Old_age   Always       -       36
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
    189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1
    190 Airflow_Temperature_Cel 0x0022   071   043   045    Old_age   Always   In_the_past 29 (Min/Max 26/47 #276)
    191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       73
    193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       246965
    194 Temperature_Celsius     0x0022   029   057   000    Old_age   Always       -       29 (0 10 0 0 0)
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       24
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       24
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       7390h+46m+24.741s
    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4025112595
    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       21008283939

    SMART Error Log Version: 1
    ATA Error Count: 36 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

    Error 36 occurred at disk power-on lifetime: 61890 hours (2578 days + 18 hours)
      When the command that caused the error occurred, the device was active or idle.

      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 78 ff ff ff 4f 00      23:13:11.442  READ FPDMA QUEUED
      60 00 00 ff ff ff 4f 00      23:13:11.440  READ FPDMA QUEUED
      60 00 00 ff ff ff 4f 00      23:13:11.440  READ FPDMA QUEUED
      60 00 00 ff ff ff 4f 00      23:13:11.439  READ FPDMA QUEUED
      60 00 00 ff ff ff 4f 00      23:13:11.439  READ FPDMA QUEUED

    Error 35 occurred at disk power-on lifetime: 61890 hours (2578 days + 18 hours)
      When the command that caused the error occurred, the device was active or idle.

      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 00 ff ff ff 4f 00      23:13:07.654  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:13:07.654  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:13:07.639  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:13:07.638  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:13:07.638  READ FPDMA QUEUED

    Error 34 occurred at disk power-on lifetime: 61890 hours (2578 days + 18 hours)
      When the command that caused the error occurred, the device was active or idle.

      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 00 ff ff ff 4f 00      23:13:03.937  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:13:03.936  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:13:03.919  READ FPDMA QUEUED
      60 00 70 ff ff ff 4f 00      23:13:03.918  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:13:03.918  READ FPDMA QUEUED

    Error 33 occurred at disk power-on lifetime: 61890 hours (2578 days + 18 hours)
      When the command that caused the error occurred, the device was active or idle.

      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 08 ff ff ff 4f 00      23:13:00.189  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:13:00.189  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:13:00.189  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:13:00.188  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:13:00.188  READ FPDMA QUEUED

    Error 32 occurred at disk power-on lifetime: 61890 hours (2578 days + 18 hours)
      When the command that caused the error occurred, the device was active or idle.

      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 00 38 ff ff ff 4f 00      23:12:56.453  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:12:56.441  READ FPDMA QUEUED
      60 00 70 ff ff ff 4f 00      23:12:56.441  READ FPDMA QUEUED
      60 00 08 ff ff ff 4f 00      23:12:56.441  READ FPDMA QUEUED
      60 00 20 ff ff ff 4f 00      23:12:56.441  READ FPDMA QUEUED

    SMART Self-test log structure revision number 1
    No self-tests have been logged.  [To run self-tests, use: smartctl -t]

    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.

    ~ #




  • Mijzelf
    Mijzelf Posts: 2,625  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options
    I wouldn't classify myself as a SMART expert, but this disk looks healthy to me. Not a single Reallocated_Sector_Ct, which is good. 24 Current_Pending_Sector, which could have caused the drop. The Raw_Read_Error_Rate and Seek_Error_Rate seem a bit high, but to be honest I don't know what would be sane numbers for a 7 year old disk with 21008283939 Total_LBAs_Read.
    You can re-create the disk as you did before. As 'Active Device 0' isn't there, the command is
    mdadm --create <arguments>  /dev/md2 missing Disk2_3 Disk4_3 Disk3_3
    where you have to find the actual device names by reading the output of 'mdadm --examine /dev/sda[abcd]3'. <arguments> are in the beginning of this thread.
    After that first run e2fsck on /dev/md2 before rebooting or mounting the array.
    When the array is up again, you can't add the 4th disk directly. The same Current_Pending_Sector will stop the rebuild. You have the option to create a backup first, and then you can fill all unused sectors with zeros, as described in the thread I pointed to, to reset the Current_Pending_Sector's. It took 2 hours for 500GB, so it will take around 20 hours for your 4.5TB free space. You can run it in screen, of course.
    After that, the Current_Pending_Sector should be zero (and check also the other original disk), if not, there is one or more sectors in the data which are not readable, and I think that will stop the rebuild again.
  • BjoWis
    BjoWis Posts: 33  Freshman Member
    10 Comments Friend Collector
    edited November 2022
    Options
    Hmm, a bit wierd perhaps, but when I inserted the other two disks and rebooted the NAS, the Volume was again identified as 'degraded', and the box started beeping. When I opened the Storage Manager the webb interfaced showed a degraded volume, just as it did last time before I tried to repair the RAID with my latest disk.

    So this time I've started a backup job of my degraded Volume. It's currently being backed up to an external 5 TB USB-drive connected to the front USB port.

    Let's see if that goes well, and if all necessary data is successfully backed up, my plan is to create a new volume from scratch, instead of trying to repair the degraded one.

    And then restore the data back to the NAS from my backup drive.


    Would that work or is there something I'm missing here?
  • BjoWis
    BjoWis Posts: 33  Freshman Member
    10 Comments Friend Collector
    Options
    The backup job was completed last night at 2022-11-15 04:40



    And when browsing my External USB disk I find 589 .dar-files and a single .lst-file




  • Mijzelf
    Mijzelf Posts: 2,625  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options
    Check that you can read the backups before destroying the volume. You'll not be the first one to find he can't read the product of the BackupPlanner. Link.

Consumer Product Help Center