NAS 542 raid 5 volume down

BjoWis · November 2022

Ok, thanks for your input - quite a bit to take in

If you have pulled all disks except the 'ddrecued one', you can run 'mdadm --examine /dev/sd[abcd]3 on it, to see if it is indeed the problem one.

I removed all disks except the ddrescued one (Disk3);

~ # mdadm --examine /dev/sd[abcd]3
/dev/sdc3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
           Name : NAS542:2 (local to host NAS542)
Creation Time : Wed Oct 5 13:17:25 2022
     Raid Level : raid5
   Raid Devices : 4

Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
     Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 3e046eeb:3bac7e38:2c4e2408:d05d7a1d

    Update Time : Thu Nov 10 16:51:14 2022
       Checksum : 6084a10 - correct
         Events : 1148

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 3
   Array State : .A.A ('A' == active, '.' == missing)
~ #

But it says "Device Role : Active device 3"

you wrote:

In that case it's 'Device role' is 'Active device 2'. In that case this shouldn't have happened.

So now I don't know if that disk is the problem or not?

Mijzelf · November 2022

If all the info you have given me is correct, the 'ddrescued one' is not the problem:

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2022.11.10 16:46:58 =~=~=~=~=~=~=~=~=~=~=~=
<snip>
~ # mdadm --examine /dev/sd[abcd]3
<snip>
/dev/sdc3:
<snip>
   Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
<snip>
   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing)

mdadm reads here the raid header from the disk. Each header is unique, so there is only one disk in the world with that 'Array UUID' and that 'Device Role'. This disk is the one which was dropped (the 'Array State' tells that. It shows a healthy array, as this header is not updated after the drop, and dmesg told the same.)

The dump in your post is another disk. It's also called sdc, but unfortunately (in this case) that doesn't say much. On boot all disks get a device name, the first one found gets sda, the second sdb, ... . So the sequence of finding generates the device name, and that can change. For that reason you have to look at the content (or the serial number) to identify a disk.

In this case I'm surprised the only disk in the box showed up as sdc, as I expected sda. Probably you have 2 usb sticks and/or SD cards? Or you hotpulled the disks?

For this reason it's important to look at the 'Device Roles' of all disks before executing 'mdadm --create', as that command eats the volatile device names, and generates new headers, overwriting the old roles.

BjoWis · November 2022

Hmm... I hotpulled the disks

maybe not the smartest way? - Sorry!

After rebooting the NAS (with only the ddrescued disk 3 inserted) I get;

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2022.11.11 11:46:11 =~=~=~=~=~=~=~=~=~=~=~=
login as: admin
admin@192.168.1.157's password:

BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ $ su
Password:

BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ # mdadm --examine /dev/sd[abcd]3
/dev/sda3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
           Name : NAS542:2 (local to host NAS542)
Creation Time : Wed Oct 5 13:17:25 2022
     Raid Level : raid5
   Raid Devices : 4

Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
     Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 3e046eeb:3bac7e38:2c4e2408:d05d7a1d

    Update Time : Thu Nov 10 16:51:14 2022
       Checksum : 6084a10 - correct
         Events : 1148

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 3
   Array State : .A.A ('A' == active, '.' == missing)
mdadm: cannot open /dev/sdb3: No such device or address
mdadm: cannot open /dev/sdc3: No such device or address
mdadm: cannot open /dev/sdd3: No such device or address
~ #

Rebooting again with only disk 2 inserted;

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2022.11.11 11:54:50 =~=~=~=~=~=~=~=~=~=~=~=
login as: admin
admin@192.168.1.157's password:

BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ $ su
Password:

BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ # mdadm --examine /dev/sd[abcd]3
/dev/sda3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
           Name : NAS542:2 (local to host NAS542)
Creation Time : Wed Oct 5 13:17:25 2022
     Raid Level : raid5
   Raid Devices : 4

Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
     Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : f7c083fd:fe37f383:55424937:52ec4bd2

    Update Time : Thu Nov 10 16:51:14 2022
       Checksum : 4788153a - correct
         Events : 1148

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 1
   Array State : .AAA ('A' == active, '.' == missing)
mdadm: cannot open /dev/sdb3: No such device or address
mdadm: cannot open /dev/sdc3: No such device or address
mdadm: cannot open /dev/sdd3: No such device or address
~ #

Rebooting again with only disk 4 inserted

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2022.11.11 12:01:35 =~=~=~=~=~=~=~=~=~=~=~=
login as: admin
admin@192.168.1.157's password:

BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ $ su
Password:

BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ # mdadm --examine /dev/sd[abcd]3
/dev/sda3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
           Name : NAS542:2 (local to host NAS542)
Creation Time : Wed Oct 5 13:17:25 2022
     Raid Level : raid5
   Raid Devices : 4

Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
     Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 803b4d20:57570076:2d7c62d1:e4bf567a

    Update Time : Thu Nov 10 09:48:48 2022
       Checksum : 9e815985 - correct
         Events : 1148

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing)
mdadm: cannot open /dev/sdb3: No such device or address
mdadm: cannot open /dev/sdc3: No such device or address
mdadm: cannot open /dev/sdd3: No such device or address
~ #

Disk 2 is device 1

Disk 3 is device 3

Disk 4 is device 2

Mijzelf · November 2022

Hmm... I hotpulled the disks

maybe not the smartest way? - Sorry!

No problem. Both the disks and the NAS support hotplugging and -pulling. It's just that the device node names get a bit unpredictable.

Disk 4 is device 2

So that is the one which was dropped. Big question is now, is that disk also dying, or was this a sector which was not accessed for a long time (maybe ever), which happened to have an invalid checksum. The 'Current Pending Sector' value of SMART can tell if the 2nd option is possible.

smartctl -a /dev/sda

(If you didn't change the disks meanwhile.)

BjoWis · November 2022

Ok, here's the result;

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2022.11.12 15:24:32 =~=~=~=~=~=~=~=~=~=~=~=
login as: admin
admin@192.168.1.157's password:

BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ $ su
Password:

BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
Enter 'help' for a list of built-in commands.

~ # smartctl -a /dev/sda
smartctl 6.3 2014-07-26 r3976 [armv7l-linux-3.2.54] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1CH166
Serial Number:    Z1F4GQ1T
LU WWN Device Id: 5 000c50 065c152e8
Firmware Version: CC27
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Nov 12 15:26:32 2022 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         ( 584) seconds.
Offline data collection
capabilities:             (0x73) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   1) minutes.
Extended self-test routine
recommended polling time:     ( 339) minutes.
Conveyance self-test routine
recommended polling time:     (   2) minutes.
SCT capabilities:            (0x3085)    SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x000f   110   099   006    Pre-fail Always       -       188790814
3 Spin_Up_Time            0x0003   094   094   000    Pre-fail Always       -       0
4 Start_Stop_Count        0x0032   091   091   020    Old_age   Always       -       9763
5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail Always       -       0
7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail Always       -       23202616
9 Power_On_Hours          0x0032   030   030   000    Old_age   Always       -       61924
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       123
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   064   064   000    Old_age   Always       -       36
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   071   043   045    Old_age   Always   In_the_past 29 (Min/Max 26/47 #276)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       73
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       246965
194 Temperature_Celsius     0x0022   029   057   000    Old_age   Always       -       29 (0 10 0 0 0)
197 Current_Pending_Sector 0x0012   100   100   000    Old_age   Always       -       24
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       24
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       7390h+46m+24.741s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4025112595
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       21008283939

SMART Error Log Version: 1
ATA Error Count: 36 (device log contains only the most recent five errors)
   CR = Command Register [HEX]
   FR = Features Register [HEX]
   SC = Sector Count Register [HEX]
   SN = Sector Number Register [HEX]
   CL = Cylinder Low Register [HEX]
   CH = Cylinder High Register [HEX]
   DH = Device/Head Register [HEX]
   DC = Device Command Register [HEX]
   ER = Error register [HEX]
   ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 36 occurred at disk power-on lifetime: 61890 hours (2578 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 78 ff ff ff 4f 00      23:13:11.442 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00      23:13:11.440 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00      23:13:11.440 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00      23:13:11.439 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00      23:13:11.439 READ FPDMA QUEUED

Error 35 occurred at disk power-on lifetime: 61890 hours (2578 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00      23:13:07.654 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:13:07.654 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:13:07.639 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:13:07.638 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:13:07.638 READ FPDMA QUEUED

Error 34 occurred at disk power-on lifetime: 61890 hours (2578 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00      23:13:03.937 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:13:03.936 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:13:03.919 READ FPDMA QUEUED
60 00 70 ff ff ff 4f 00      23:13:03.918 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:13:03.918 READ FPDMA QUEUED

Error 33 occurred at disk power-on lifetime: 61890 hours (2578 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 08 ff ff ff 4f 00      23:13:00.189 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:13:00.189 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:13:00.189 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:13:00.188 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:13:00.188 READ FPDMA QUEUED

Error 32 occurred at disk power-on lifetime: 61890 hours (2578 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 38 ff ff ff 4f 00      23:12:56.453 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:12:56.441 READ FPDMA QUEUED
60 00 70 ff ff ff 4f 00      23:12:56.441 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00      23:12:56.441 READ FPDMA QUEUED
60 00 20 ff ff ff 4f 00      23:12:56.441 READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1        0        0 Not_testing
    2        0        0 Not_testing
    3        0        0 Not_testing
    4        0        0 Not_testing
    5        0        0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

~ #

Mijzelf · November 2022

I wouldn't classify myself as a SMART expert, but this disk looks healthy to me. Not a single Reallocated_Sector_Ct, which is good. 24 Current_Pending_Sector, which could have caused the drop. The Raw_Read_Error_Rate and Seek_Error_Rate seem a bit high, but to be honest I don't know what would be sane numbers for a 7 year old disk with 21008283939 Total_LBAs_Read.

You can re-create the disk as you did before. As 'Active Device 0' isn't there, the command is

mdadm --create <arguments> /dev/md2 missing Disk2_3 Disk4_3 Disk3_3

where you have to find the actual device names by reading the output of 'mdadm --examine /dev/sda[abcd]3'. <arguments> are in the beginning of this thread.

After that first run e2fsck on /dev/md2 before rebooting or mounting the array.

When the array is up again, you can't add the 4th disk directly. The same Current_Pending_Sector will stop the rebuild. You have the option to create a backup first, and then you can fill all unused sectors with zeros, as described in the thread I pointed to, to reset the Current_Pending_Sector's. It took 2 hours for 500GB, so it will take around 20 hours for your 4.5TB free space. You can run it in screen, of course.

After that, the Current_Pending_Sector should be zero (and check also the other original disk), if not, there is one or more sectors in the data which are not readable, and I think that will stop the rebuild again.

BjoWis · November 2022

Hmm, a bit wierd perhaps, but when I inserted the other two disks and rebooted the NAS, the Volume was again identified as 'degraded', and the box started beeping. When I opened the Storage Manager the webb interfaced showed a degraded volume, just as it did last time before I tried to repair the RAID with my latest disk.

So this time I've started a backup job of my degraded Volume. It's currently being backed up to an external 5 TB USB-drive connected to the front USB port.

Let's see if that goes well, and if all necessary data is successfully backed up, my plan is to create a new volume from scratch, instead of trying to repair the degraded one.

And then restore the data back to the NAS from my backup drive.

Image: https://us.v-cdn.net/6029482/uploads/editor/fb/hgvuw4w4qdza.jpg

Would that work or is there something I'm missing here?

BjoWis · November 2022

The backup job was completed last night at 2022-11-15 04:40

Image: https://us.v-cdn.net/6029482/uploads/editor/ib/ts449vg70bsv.jpg

And when browsing my External USB disk I find 589 .dar-files and a single .lst-file

Image: https://us.v-cdn.net/6029482/uploads/editor/35/souo9e6v6rtu.jpg

Mijzelf · November 2022

Check that you can read the backups before destroying the volume. You'll not be the first one to find he can't read the product of the BackupPlanner. Link.

NAS 542 raid 5 volume down

All Replies

Categories

Consumer Product Help Center