How do I recover a volume after repair process silently failed?

Mijzelf · December 2020

Maybe. Did you reboot already? You can check if the new headers describe the same array type,offset,blocksize as before:

mdadm --examine /dev/sd[abcd]3

Further don't trust the firmware. Have a look at the kernel view on the array:

cat /proc/mdstat

bjorn · December 2020

Oops! My apologies, Mijzelf. Following the reboot here are the mdadm outputs. I'm partially back up, with the degraded volume,

~ # mdadm --examine /dev/sda3
/dev/sda3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : fa2bac0d:b9adfa1a:a4dcc64b:fc7a555b
           Name : NAS540:2  (local to host NAS540)
  Creation Time : Tue Dec  1 11:31:20 2020
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660160 (11166.25 GiB 11989.67 GB)
  Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 5b44e4bd:e37142b2:23d26a4d:cb281462

    Update Time : Wed Dec  2 13:18:45 2020
       Checksum : 61fc7575 - correct
         Events : 122

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 0
   Array State : A.AA ('A' == active, '.' == missing)
~ # mdadm --examine /dev/sdb3
/dev/sdb3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : ad82b6f7:6aacc5f3:c7a86a8b:25240df4
           Name : NAS540:2  (local to host NAS540)
  Creation Time : Thu Jul 27 14:12:32 2017
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660160 (11166.25 GiB 11989.67 GB)
  Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 0a5c35b6:3bd8a182:5030b8be:51bbe238

    Update Time : Thu Oct 22 22:21:39 2020
       Checksum : 77ae1fd - correct
         Events : 47

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 1
   Array State : AAAA ('A' == active, '.' == missing)
~ # mdadm --examine /dev/sdc3
/dev/sdc3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : fa2bac0d:b9adfa1a:a4dcc64b:fc7a555b
           Name : NAS540:2  (local to host NAS540)
  Creation Time : Tue Dec  1 11:31:20 2020
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660160 (11166.25 GiB 11989.67 GB)
  Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : ebcb668f:f6c07008:9efd8d2f:ad7314ad

    Update Time : Wed Dec  2 13:18:45 2020
       Checksum : b6d0c276 - correct
         Events : 122

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 2
   Array State : A.AA ('A' == active, '.' == missing)
~ # mdadm --examine /dev/sdd3
/dev/sdd3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : fa2bac0d:b9adfa1a:a4dcc64b:fc7a555b
           Name : NAS540:2  (local to host NAS540)
  Creation Time : Tue Dec  1 11:31:20 2020
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660160 (11166.25 GiB 11989.67 GB)
  Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 460883bd:f423662f:25ca9304:9fe9a52e

    Update Time : Wed Dec  2 13:18:50 2020
       Checksum : 6279a450 - correct
         Events : 124

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 3
   Array State : A.AA ('A' == active, '.' == missing)

and

~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md3 : inactive sdb3[1](S)
      3902886912 blocks super 1.2

md2 : active raid5 sda3[0] sdd3[3] sdc3[2]
      11708660160 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [U_UU]

md1 : active raid1 sda2[0] sdd2[3] sdc2[2] sdb2[4]
      1998784 blocks super 1.2 [4/4] [UUUU]

md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[4]
      1997760 blocks super 1.2 [4/4] [UUUU]

unused devices:

It looks like the missing drive is the one I'm still hoping to recover, which shows `Lost` (the amelia-1 share),

Image: https://us.v-cdn.net/6029482/uploads/editor/2x/3206whd6jn19.png

Anything left to try?

I really appreciate your effort.

Mijzelf · December 2020

The headers look good, and the array is up, and seeing that 'Lost' changed to 'Disabled', I think you only have to enter the Shares menu to enable them.

As far as the firmware knows, you put in 3 disks containing a new volume, so the 'old' shares are no longer available.

bjorn · December 2020

I've been able to get a decent amount off the drive once I enabled them. (It's very very slow and still obviously beeping.) After this completes, are there any diagnostics I can run to fully assess each disk and make sure all are in working order before reseting it?

Mijzelf · December 2020

It's very very slow and still obviously beeping.

About beeping

buzzerc -s && mv /sbin/buzzerc /sbin/buzzerc.old

will stop the buzzer, and remove the possibility for the firmware to start it again. Till the next reboot.

About slow, it shouldn't be much slower than before. The array is degraded, which means one out of 3 blocks has to be recalculated from the parity, but the NAS can do that at 1GB/sec, so hardly noticeable. The box should do 75-100MB/sec for big files. (Which of course, is still 37 hours for 10TB)

After this completes, are there any diagnostics I can run to fully assess each disk

That's complicated. The SMART values of the disks can tell if the disks themselves 'feel healthy'.

It is possible that SMART says the disks are completely healthy, while still a disk will be dropped if you try to add a 4th disk. The reason is aging of the data. A modern harddisk has a very high data density. A single bit is a few square nm, and so only a few dozens of magnetic atoms. Ideally those atoms are all orientated the same, so a clear 0 or 1 can be read. But due to thermal noise over time some atoms can loose their orientation, blurring the signal. At some moment it's not longer possible to tell if it's a 0 or 1. Because of this the sector has some extra bits, to be able to restore a few unreadable bits, but sometimes it stops, and the sector is unreadable. The disk will try several times to read the sector, because to positioning of the head is not 100% reproducable, so a new read will pull in some other, maybe readable atoms.

Finally the disk will report an I/O error, the sector is not readable. The raid manager will drop the disk. But, this disk is perfectly healthy. One sector is not readable, but you can simply write new data to it. If you would know what to write.

The solution is to 'resilver' the disk, which means reading each sector and writing it back. This way all atoms are orientated again, and ready for years. (It is possible that the sector which caused your problem hasn't been written to since factory. If you succeed in copying all data, you have proved the problem sector is not in use by the filesystem). Modern filesystem like ZFS en BTRFS have build-in functions for this, but unfortunately the used software raid hasn't, AFAIK.

For really unusable sectors the disk has a number of spare sectors, which can replace unusable sectors. In the SMART values there is an entry for that: "Reallocated Sectors Count". The raw value is the number of replaced sectors, the percentage is the amount of spare sectors left.

To find if a disk is still trustworthy, you should make a note of the raw value and percentage, overwrite the complete disk, and look if the value didn't change much. If not, all sectors are readable again (you just re-orientated all atoms) and there was not a significant number of hard failing sectors.

Unfortunately this will kill your data. I'm not aware of any way to resilver the data on the disk reliable. A naive way is

dd if=/dev/md2 of=/dev/md2 bs=16M

This will copy all data from md2 to md2 in blocks of 16M. That should resilver the whole surface, but unfortunately it will stop at first read error. And it's dangerous to do while the filesystem is mounted, as you could overwrite pending changes with older data.

dd if=/dev/zero of=/dev/md2 bs=16M

will write zeros to md2. It will also possibly overwrite pending changes, but as the filesystem is destroyed anyway, that doesn't matter. But make sure to read the SMART data before and after.

How do I recover a volume after repair process silently failed?

All Replies

Categories

Consumer Product Help Center