NAS540 Shows Healthy but RAID degraded.

Options
2»

All Replies

  • jahmon
    jahmon Posts: 15  Freshman Member
    First Anniversary 10 Comments
    Options

    ~ # smartctl -a /dev/sdd

    smartctl 6.3 2014-07-26 r3976 [armv7l-linux-3.2.54] (local build)

    Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

     

    === START OF INFORMATION SECTION ===

    Model Family:     Hitachi Ultrastar A7K2000

    Device Model:     Hitachi HUA722020ALA331

    Serial Number:    YBK0JV2F

    LU WWN Device Id: 5 000cca 221ea85bb

    Firmware Version: JKAOA3NH

    User Capacity:    2,000,398,934,016 bytes [2.00 TB]

    Sector Size:      512 bytes logical/physical

    Rotation Rate:    7200 rpm

    Form Factor:      3.5 inches

    Device is:        In smartctl database [for details use: -P show]

    ATA Version is:   ATA8-ACS T13/1699-D revision 4

    SATA Version is:  SATA 2.6, 3.0 Gb/s

    Local Time is:    Thu Dec  8 15:36:06 2022 CST

    SMART support is: Available - device has SMART capability.

    SMART support is: Enabled

     

    === START OF READ SMART DATA SECTION ===

    SMART overall-health self-assessment test result: FAILED!

    Drive failure expected in less than 24 hours. SAVE ALL DATA.

    See vendor-specific Attribute list for failed Attributes.

     

    General SMART Values:

    Offline data collection status:  (0x84) Offline data collection activity

                                            was suspended by an interrupting command from host.

                                            Auto Offline Data Collection: Enabled.

    Self-test execution status:      (   0) The previous self-test routine completed

                                            without error or no self-test has ever

                                            been run.

    Total time to complete Offline

    data collection:                (22624) seconds.

    Offline data collection

    capabilities:                    (0x5b) SMART execute Offline immediate.

                                            Auto Offline data collection on/off support.

                                            Suspend Offline collection upon new

                                            command.

                                            Offline surface scan supported.

                                            Self-test supported.

                                            No Conveyance Self-test supported.

                                            Selective Self-test supported.

    SMART capabilities:            (0x0003) Saves SMART data before entering

                                            power-saving mode.

                                            Supports SMART auto save timer.

    Error logging capability:        (0x01) Error logging supported.

                                            General Purpose Logging supported.

    Short self-test routine

    recommended polling time:        (   1) minutes.

    Extended self-test routine

    recommended polling time:        ( 377) minutes.

    SCT capabilities:              (0x003d) SCT Status supported.

                                            SCT Error Recovery Control supported.

                                            SCT Feature Control supported.

                                            SCT Data Table supported.

     

    SMART Attributes Data Structure revision number: 16

    Vendor Specific SMART Attributes with Thresholds:

    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

      1 Raw_Read_Error_Rate     0x000b   083   083   016    Pre-fail  Always       -       131506

      2 Throughput_Performance  0x0005   130   130   054    Pre-fail  Offline      -       112

      3 Spin_Up_Time            0x0007   116   116   024    Pre-fail  Always       -       620 (Average 620)

      4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       2108

      5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1058

      7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

      8 Seek_Time_Performance   0x0005   123   123   020    Pre-fail  Offline      -       34

      9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       6202

    10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

    12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       39

    192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       2110

    193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       2110

    194 Temperature_Celsius     0x0002   120   120   000    Old_age   Always       -       50 (Min/Max 21/61)

    196 Reallocated_Event_Count 0x0032   048   048   000    Old_age   Always       -       1127

    197 Current_Pending_Sector  0x0022   044   044   000    Old_age   Always       -       1201

    198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

    199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

     

    SMART Error Log Version: 1

    ATA Error Count: 19 (device log contains only the most recent five errors)

            CR = Command Register [HEX]

            FR = Features Register [HEX]

            SC = Sector Count Register [HEX]

            SN = Sector Number Register [HEX]

            CL = Cylinder Low Register [HEX]

            CH = Cylinder High Register [HEX]

            DH = Device/Head Register [HEX]

            DC = Device Command Register [HEX]

            ER = Error register [HEX]

            ST = Status register [HEX]

    Powered_Up_Time is measured from power on, and printed as

    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

     

    Error 19 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 2c 8c bc 68 03  Error: UNC 44 sectors at LBA = 0x0368bc8c = 57195660

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 38 80 bc 68 e0 08  14d+16:29:55.130  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:29:55.084  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:29:55.076  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:29:55.075  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:29:55.074  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 18 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 38 80 bc 68 03  Error: UNC 56 sectors at LBA = 0x0368bc80 = 57195648

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 38 80 bc 68 e0 08  14d+16:29:38.602  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:29:38.555  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:29:38.547  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:29:38.546  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:29:38.545  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 17 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 38 80 bc 68 03  Error: UNC 56 sectors at LBA = 0x0368bc80 = 57195648

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 38 80 bc 68 e0 08  14d+16:29:22.061  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:29:21.261  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:29:21.253  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:29:21.252  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:29:21.251  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 16 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 3f c1 bc 68 03  Error: UNC 63 sectors at LBA = 0x0368bcc1 = 57195713

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 48 b8 bc 68 e0 08  14d+16:28:37.109  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:28:37.063  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:28:37.055  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:28:37.054  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:28:37.053  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 15 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 45 bb bc 68 03  Error: UNC 69 sectors at LBA = 0x0368bcbb = 57195707

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 48 b8 bc 68 e0 08  14d+16:28:12.391  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:28:12.344  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:28:12.336  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:28:12.335  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:28:12.334  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    SMART Self-test log structure revision number 1

    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

    # 1  Short offline       Completed without error       00%     58737         -

    # 2  Short offline       Completed without error       00%     57416         -

    # 3  Short offline       Completed without error       00%     57414         -

    # 4  Short offline       Completed without error       00%     53020         -

    # 5  Short offline       Completed without error       00%     53017         -

    # 6  Short offline       Completed without error       00%     53013         -

    # 7  Short offline       Completed without error       00%     53011         -

    # 8  Short offline       Completed without error       00%     53009         -

    # 9  Short offline       Completed without error       00%     45977         -

    #10  Short offline       Completed without error       00%     40631         -

    #11  Short offline       Completed without error       00%     38232         -

    #12  Short offline       Completed without error       00%     38228         -

    #13  Short offline       Completed without error       00%     38226         -

    #14  Short offline       Completed without error       00%     37252         -

     

    SMART Selective self-test log data structure revision number 1

    SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

        1        0        0  Not_testing

        2        0        0  Not_testing

        3        0        0  Not_testing

        4        0        0  Not_testing

        5        0        0  Not_testing

    Selective self-test flags (0x0):

      After scanning selected spans, do NOT read-scan remainder of disk.

    If Selective self-test is pending on power-up, resume after 0 minute delay.

     

    ~ #

  • jahmon
    jahmon Posts: 15  Freshman Member
    First Anniversary 10 Comments
    Options
    There's a lot of info here.  You are correct my config is sda and sdd. I may be missing something but this looks like it passes self-test.  Your help appreciated!

    Here's the logs from one as it is too long.

    ~ # smartctl -a /dev/sdd

    smartctl 6.3 2014-07-26 r3976 [armv7l-linux-3.2.54] (local build)

    Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

     

    === START OF INFORMATION SECTION ===

    Model Family:     Hitachi Ultrastar A7K2000

    Device Model:     Hitachi HUA722020ALA331

    Serial Number:    YBK0JV2F

    LU WWN Device Id: 5 000cca 221ea85bb

    Firmware Version: JKAOA3NH

    User Capacity:    2,000,398,934,016 bytes [2.00 TB]

    Sector Size:      512 bytes logical/physical

    Rotation Rate:    7200 rpm

    Form Factor:      3.5 inches

    Device is:        In smartctl database [for details use: -P show]

    ATA Version is:   ATA8-ACS T13/1699-D revision 4

    SATA Version is:  SATA 2.6, 3.0 Gb/s

    Local Time is:    Thu Dec  8 15:36:06 2022 CST

    SMART support is: Available - device has SMART capability.

    SMART support is: Enabled

     

    === START OF READ SMART DATA SECTION ===

    SMART overall-health self-assessment test result: FAILED!

    Drive failure expected in less than 24 hours. SAVE ALL DATA.

    See vendor-specific Attribute list for failed Attributes.

     

    General SMART Values:

    Offline data collection status:  (0x84) Offline data collection activity

                                            was suspended by an interrupting command from host.

                                            Auto Offline Data Collection: Enabled.

    Self-test execution status:      (   0) The previous self-test routine completed

                                            without error or no self-test has ever

                                            been run.

    Total time to complete Offline

    data collection:                (22624) seconds.

    Offline data collection

    capabilities:                    (0x5b) SMART execute Offline immediate.

                                            Auto Offline data collection on/off support.

                                            Suspend Offline collection upon new

                                            command.

                                            Offline surface scan supported.

                                            Self-test supported.

                                            No Conveyance Self-test supported.

                                            Selective Self-test supported.

    SMART capabilities:            (0x0003) Saves SMART data before entering

                                            power-saving mode.

                                            Supports SMART auto save timer.

    Error logging capability:        (0x01) Error logging supported.

                                            General Purpose Logging supported.

    Short self-test routine

    recommended polling time:        (   1) minutes.

    Extended self-test routine

    recommended polling time:        ( 377) minutes.

    SCT capabilities:              (0x003d) SCT Status supported.

                                            SCT Error Recovery Control supported.

                                            SCT Feature Control supported.

                                            SCT Data Table supported.

     

    SMART Attributes Data Structure revision number: 16

    Vendor Specific SMART Attributes with Thresholds:

    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

      1 Raw_Read_Error_Rate     0x000b   083   083   016    Pre-fail  Always       -       131506

      2 Throughput_Performance  0x0005   130   130   054    Pre-fail  Offline      -       112

      3 Spin_Up_Time            0x0007   116   116   024    Pre-fail  Always       -       620 (Average 620)

      4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       2108

      5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1058

      7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

      8 Seek_Time_Performance   0x0005   123   123   020    Pre-fail  Offline      -       34

      9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       6202

    10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

    12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       39

    192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       2110

    193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       2110

    194 Temperature_Celsius     0x0002   120   120   000    Old_age   Always       -       50 (Min/Max 21/61)

    196 Reallocated_Event_Count 0x0032   048   048   000    Old_age   Always       -       1127

    197 Current_Pending_Sector  0x0022   044   044   000    Old_age   Always       -       1201

    198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

    199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

     

    SMART Error Log Version: 1

    ATA Error Count: 19 (device log contains only the most recent five errors)

            CR = Command Register [HEX]

            FR = Features Register [HEX]

            SC = Sector Count Register [HEX]

            SN = Sector Number Register [HEX]

            CL = Cylinder Low Register [HEX]

            CH = Cylinder High Register [HEX]

            DH = Device/Head Register [HEX]

            DC = Device Command Register [HEX]

            ER = Error register [HEX]

            ST = Status register [HEX]

    Powered_Up_Time is measured from power on, and printed as

    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

     

    Error 19 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 2c 8c bc 68 03  Error: UNC 44 sectors at LBA = 0x0368bc8c = 57195660

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 38 80 bc 68 e0 08  14d+16:29:55.130  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:29:55.084  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:29:55.076  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:29:55.075  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:29:55.074  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 18 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 38 80 bc 68 03  Error: UNC 56 sectors at LBA = 0x0368bc80 = 57195648

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 38 80 bc 68 e0 08  14d+16:29:38.602  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:29:38.555  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:29:38.547  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:29:38.546  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:29:38.545  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 17 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 38 80 bc 68 03  Error: UNC 56 sectors at LBA = 0x0368bc80 = 57195648

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 38 80 bc 68 e0 08  14d+16:29:22.061  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:29:21.261  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:29:21.253  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:29:21.252  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:29:21.251  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 16 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 3f c1 bc 68 03  Error: UNC 63 sectors at LBA = 0x0368bcc1 = 57195713

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 48 b8 bc 68 e0 08  14d+16:28:37.109  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:28:37.063  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:28:37.055  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:28:37.054  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:28:37.053  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 15 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 45 bb bc 68 03  Error: UNC 69 sectors at LBA = 0x0368bcbb = 57195707

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 48 b8 bc 68 e0 08  14d+16:28:12.391  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:28:12.344  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:28:12.336  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:28:12.335  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:28:12.334  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    SMART Self-test log structure revision number 1

    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

    # 1  Short offline       Completed without error       00%     58737         -

    # 2  Short offline       Completed without error       00%     57416         -

    # 3  Short offline       Completed without error       00%     57414         -

    # 4  Short offline       Completed without error       00%     53020         -

    # 5  Short offline       Completed without error       00%     53017         -

    # 6  Short offline       Completed without error       00%     53013         -

    # 7  Short offline       Completed without error       00%     53011         -

    # 8  Short offline       Completed without error       00%     53009         -

    # 9  Short offline       Completed without error       00%     45977         -

    #10  Short offline       Completed without error       00%     40631         -

    #11  Short offline       Completed without error       00%     38232         -

    #12  Short offline       Completed without error       00%     38228         -

    #13  Short offline       Completed without error       00%     38226         -

    #14  Short offline       Completed without error       00%     37252         -

     

    SMART Selective self-test log data structure revision number 1

    SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

        1        0        0  Not_testing

        2        0        0  Not_testing

        3        0        0  Not_testing

        4        0        0  Not_testing

        5        0        0  Not_testing

    Selective self-test flags (0x0):

      After scanning selected spans, do NOT read-scan remainder of disk.

    If Selective self-test is pending on power-up, resume after 0 minute delay.

  • Mijzelf
    Mijzelf Posts: 2,613  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options
    Well, that is clear:

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: FAILED!
    Drive failure expected in less than 24 hours. SAVE ALL DATA.
    <snip>
      5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1058

    So the problem is the Reallocated_Sector_Ct. I have no idea why that looks different in the GUI. The disk itself has recorded 4 errors at 5921 power-on hours (which is almost 300 hours ago), at sectors 57195660, 57195648, 57195713 and 57195707. So that 24 hours is a bit exaggerated. Assuming that are 4k sectors, that is around 218GB from the start of the disk, so that is well inside the data partition.

  • jahmon
    jahmon Posts: 15  Freshman Member
    First Anniversary 10 Comments
    Options
    So I looked through it more carefully, right in the beginning of the "SMART Data Section" there is a line "SMART overall-health self-assessment test result:"  Drive A shows "Passed", Drive D shows "Failed".  As D is the RAID 5 parity drive, this explains why it's degraded, but I can still access the data and the rebuild fails while processing.  Have I got it?  
  • jahmon
    jahmon Posts: 15  Freshman Member
    First Anniversary 10 Comments
    Options
    There's a lot of info here.  You are correct my config is sda and sdd. I may be missing something but this looks like it passes self-test.  Your help appreciated!

    Here's the logs from one as it is too long.

    ~ # smartctl -a /dev/sdd

    smartctl 6.3 2014-07-26 r3976 [armv7l-linux-3.2.54] (local build)

    Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

     

    === START OF INFORMATION SECTION ===

    Model Family:     Hitachi Ultrastar A7K2000

    Device Model:     Hitachi HUA722020ALA331

    Serial Number:    YBK0JV2F

    LU WWN Device Id: 5 000cca 221ea85bb

    Firmware Version: JKAOA3NH

    User Capacity:    2,000,398,934,016 bytes [2.00 TB]

    Sector Size:      512 bytes logical/physical

    Rotation Rate:    7200 rpm

    Form Factor:      3.5 inches

    Device is:        In smartctl database [for details use: -P show]

    ATA Version is:   ATA8-ACS T13/1699-D revision 4

    SATA Version is:  SATA 2.6, 3.0 Gb/s

    Local Time is:    Thu Dec  8 15:36:06 2022 CST

    SMART support is: Available - device has SMART capability.

    SMART support is: Enabled

     

    === START OF READ SMART DATA SECTION ===

    SMART overall-health self-assessment test result: FAILED!

    Drive failure expected in less than 24 hours. SAVE ALL DATA.

    See vendor-specific Attribute list for failed Attributes.

     

    General SMART Values:

    Offline data collection status:  (0x84) Offline data collection activity

                                            was suspended by an interrupting command from host.

                                            Auto Offline Data Collection: Enabled.

    Self-test execution status:      (   0) The previous self-test routine completed

                                            without error or no self-test has ever

                                            been run.

    Total time to complete Offline

    data collection:                (22624) seconds.

    Offline data collection

    capabilities:                    (0x5b) SMART execute Offline immediate.

                                            Auto Offline data collection on/off support.

                                            Suspend Offline collection upon new

                                            command.

                                            Offline surface scan supported.

                                            Self-test supported.

                                            No Conveyance Self-test supported.

                                            Selective Self-test supported.

    SMART capabilities:            (0x0003) Saves SMART data before entering

                                            power-saving mode.

                                            Supports SMART auto save timer.

    Error logging capability:        (0x01) Error logging supported.

                                            General Purpose Logging supported.

    Short self-test routine

    recommended polling time:        (   1) minutes.

    Extended self-test routine

    recommended polling time:        ( 377) minutes.

    SCT capabilities:              (0x003d) SCT Status supported.

                                            SCT Error Recovery Control supported.

                                            SCT Feature Control supported.

                                            SCT Data Table supported.

     

    SMART Attributes Data Structure revision number: 16

    Vendor Specific SMART Attributes with Thresholds:

    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

      1 Raw_Read_Error_Rate     0x000b   083   083   016    Pre-fail  Always       -       131506

      2 Throughput_Performance  0x0005   130   130   054    Pre-fail  Offline      -       112

      3 Spin_Up_Time            0x0007   116   116   024    Pre-fail  Always       -       620 (Average 620)

      4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       2108

      5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1058

      7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

      8 Seek_Time_Performance   0x0005   123   123   020    Pre-fail  Offline      -       34

      9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       6202

    10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

    12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       39

    192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       2110

    193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       2110

    194 Temperature_Celsius     0x0002   120   120   000    Old_age   Always       -       50 (Min/Max 21/61)

    196 Reallocated_Event_Count 0x0032   048   048   000    Old_age   Always       -       1127

    197 Current_Pending_Sector  0x0022   044   044   000    Old_age   Always       -       1201

    198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

    199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

     

    SMART Error Log Version: 1

    ATA Error Count: 19 (device log contains only the most recent five errors)

            CR = Command Register [HEX]

            FR = Features Register [HEX]

            SC = Sector Count Register [HEX]

            SN = Sector Number Register [HEX]

            CL = Cylinder Low Register [HEX]

            CH = Cylinder High Register [HEX]

            DH = Device/Head Register [HEX]

            DC = Device Command Register [HEX]

            ER = Error register [HEX]

            ST = Status register [HEX]

    Powered_Up_Time is measured from power on, and printed as

    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

     

    Error 19 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 2c 8c bc 68 03  Error: UNC 44 sectors at LBA = 0x0368bc8c = 57195660

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 38 80 bc 68 e0 08  14d+16:29:55.130  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:29:55.084  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:29:55.076  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:29:55.075  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:29:55.074  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 18 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 38 80 bc 68 03  Error: UNC 56 sectors at LBA = 0x0368bc80 = 57195648

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 38 80 bc 68 e0 08  14d+16:29:38.602  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:29:38.555  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:29:38.547  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:29:38.546  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:29:38.545  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 17 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 38 80 bc 68 03  Error: UNC 56 sectors at LBA = 0x0368bc80 = 57195648

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 38 80 bc 68 e0 08  14d+16:29:22.061  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:29:21.261  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:29:21.253  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:29:21.252  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:29:21.251  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 16 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 3f c1 bc 68 03  Error: UNC 63 sectors at LBA = 0x0368bcc1 = 57195713

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 48 b8 bc 68 e0 08  14d+16:28:37.109  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:28:37.063  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:28:37.055  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:28:37.054  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:28:37.053  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    Error 15 occurred at disk power-on lifetime: 5921 hours (246 days + 17 hours)

      When the command that caused the error occurred, the device was active or idle.

     

      After command completion occurred, registers were:

      ER ST SC SN CL CH DH

      -- -- -- -- -- -- --

      40 51 45 bb bc 68 03  Error: UNC 69 sectors at LBA = 0x0368bcbb = 57195707

     

      Commands leading to the command that caused the error were:

      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

      -- -- -- -- -- -- -- --  ----------------  --------------------

      25 00 48 b8 bc 68 e0 08  14d+16:28:12.391  READ DMA EXT

      27 00 00 00 00 00 e0 08  14d+16:28:12.344  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

      ec 00 00 00 00 00 a0 08  14d+16:28:12.336  IDENTIFY DEVICE

      ef 03 46 00 00 00 a0 08  14d+16:28:12.335  SET FEATURES [Set transfer mode]

      27 00 00 00 00 00 e0 08  14d+16:28:12.334  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

     

    SMART Self-test log structure revision number 1

    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

    # 1  Short offline       Completed without error       00%     58737         -

    # 2  Short offline       Completed without error       00%     57416         -

    # 3  Short offline       Completed without error       00%     57414         -

    # 4  Short offline       Completed without error       00%     53020         -

    # 5  Short offline       Completed without error       00%     53017         -

    # 6  Short offline       Completed without error       00%     53013         -

    # 7  Short offline       Completed without error       00%     53011         -

    # 8  Short offline       Completed without error       00%     53009         -

    # 9  Short offline       Completed without error       00%     45977         -

    #10  Short offline       Completed without error       00%     40631         -

    #11  Short offline       Completed without error       00%     38232         -

    #12  Short offline       Completed without error       00%     38228         -

    #13  Short offline       Completed without error       00%     38226         -

    #14  Short offline       Completed without error       00%     37252         -

     

    SMART Selective self-test log data structure revision number 1

    SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

        1        0        0  Not_testing

        2        0        0  Not_testing

        3        0        0  Not_testing

        4        0        0  Not_testing

        5        0        0  Not_testing

    Selective self-test flags (0x0):

      After scanning selected spans, do NOT read-scan remainder of disk.

    If Selective self-test is pending on power-up, resume after 0 minute delay.

  • Mijzelf
    Mijzelf Posts: 2,613  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    edited December 2022 Answer ✓
    Options
    jahmon said:
    So I looked through it more carefully, right in the beginning of the "SMART Data Section" there is a line "SMART overall-health self-assessment test result:"  Drive A shows "Passed", Drive D shows "Failed".  As D is the RAID 5 parity drive, this explains why it's degraded, but I can still access the data and the rebuild fails while processing.  Have I got it?  
    More or less. There is no parity drive in RAID5, the parity blocks are equally distributed over all disks. This is done to maximize the read speed (on a healthy raid array the parity blocks are not used for reading, and so it's a waste to not use a whole disk + it's bandwidth) and to minimize the penalty when a random disk fails.
    The raid manager is pretty dumb. When rebuilding the array is simply calculates the content of the 'new' disk from the total surface of the 3 others (the raid manager doesn't know about filesystems, and so doesn't know if a particular sector is used or not), and writes that to the disk. When a write error occurs the new disk is dropped, and the rebuild fails. And worse, if a read error occurs the relevant disk is dropped, bringing the array down.

  • jahmon
    jahmon Posts: 15  Freshman Member
    First Anniversary 10 Comments
    Options
    Thank you for the top notch support and patience!

Consumer Product Help Center