NAS542 Raid Recovery Loop

bradcornford
bradcornford Posts: 2
edited April 2021 in Personal Cloud Storage
My NAS542, running raid5 consisting of 4 4TB drivers went into a a degraded mode. I then shutdown the NAS, and replaced the disk reporting issues. After which i started the NAS and began the recovery via the web interface. However it appears to keep failing after around 70% recovery where the unit proceeds to beep constantly, the web interface gives a 500 internal server error and the mount for the NAS becomes unreadable.

After a subsequent restart, the unit then re-tries recovery, and the loop happens again.

Any help in fixing this issue would be appreciated.

Here's the oputput from /proc/mdstat during recovery:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md2 : active raid5 sda3[0] sdb3[5] sdd3[3] sdc3[4]
      11708660736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [U_UU]
      [>....................]  recovery =  0.3% (13922936/3902886912) finish=2237.5min speed=28966K/sec
      
md1 : active raid1 sda2[0] sdd2[3] sdc2[2] sdb2[4]
      1998784 blocks super 1.2 [4/4] [UUUU]
      
md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[4]
      1997760 blocks super 1.2 [4/4] [UUUU]
      bitmap: 0/1 pages [0KB], 65536KB chunk
Here's the output from /proc/mdstat when the unit beeps and the web ui errors:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md2 : active raid5 sda3[0] sdb3[5](S) sdd3[3] sdc3[4](F)
      11708660736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/2] [U__U]
      
md1 : active raid1 sda2[0] sdd2[3] sdc2[2] sdb2[4]
      1998784 blocks super 1.2 [4/4] [UUUU]
      
md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[4]
      1997760 blocks super 1.2 [4/4] [UUUU]
      bitmap: 0/1 pages [0KB], 65536KB chunk
Here's the output from mdadm --examine /dev/sd[abcd]3 during recovery:
/dev/sda3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b5920e6d:0a82a92c:f28da4e8:ae1074be
           Name : NAS542:2
  Creation Time : Tue May  1 17:36:23 2018
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660736 (11166.25 GiB 11989.67 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : baf3f0ff:f8cf59e3:8f9c725e:c9d3433e

    Update Time : Fri Apr 23 09:17:20 2021
       Checksum : 2f473016 - correct
         Events : 138038

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 0
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sdb3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x2
     Array UUID : b5920e6d:0a82a92c:f28da4e8:ae1074be
           Name : NAS542:2
  Creation Time : Tue May  1 17:36:23 2018
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660736 (11166.25 GiB 11989.67 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
Recovery Offset : 0 sectors
          State : clean
    Device UUID : de124198:9c286ee7:774140a3:fad7e15b

    Update Time : Fri Apr 23 09:17:20 2021
       Checksum : 2e1750fe - correct
         Events : 138038

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 1
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sdc3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b5920e6d:0a82a92c:f28da4e8:ae1074be
           Name : NAS542:2
  Creation Time : Tue May  1 17:36:23 2018
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660736 (11166.25 GiB 11989.67 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 43040323:9f443330:80a71650:861cf46b

    Update Time : Fri Apr 23 09:17:20 2021
       Checksum : be871176 - correct
         Events : 138038

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sdd3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b5920e6d:0a82a92c:f28da4e8:ae1074be
           Name : NAS542:2
  Creation Time : Tue May  1 17:36:23 2018
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660736 (11166.25 GiB 11989.67 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 8c3bec27:8c41d54b:00be3528:e441668f

    Update Time : Fri Apr 23 09:17:20 2021
       Checksum : daa37909 - correct
         Events : 138038

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 3
   Array State : AAAA ('A' == active, '.' == missing)


Here's the output from mdadm --examine /dev/sd[abcd]3 when the unit beeps and the web ui errors:
/dev/sda3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b5920e6d:0a82a92c:f28da4e8:ae1074be
           Name : NAS542:2
  Creation Time : Tue May  1 16:36:23 2018
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660736 (11166.25 GiB 11989.67 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : baf3f0ff:f8cf59e3:8f9c725e:c9d3433e

    Update Time : Fri Apr 23 07:27:07 2021
       Checksum : 2f4623f7 - correct
         Events : 137951

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 0
   Array State : A..A ('A' == active, '.' == missing)
/dev/sdb3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b5920e6d:0a82a92c:f28da4e8:ae1074be
           Name : NAS542:2
  Creation Time : Tue May  1 16:36:23 2018
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660736 (11166.25 GiB 11989.67 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : de124198:9c286ee7:774140a3:fad7e15b

    Update Time : Fri Apr 23 07:27:07 2021
       Checksum : 2e1644dd - correct
         Events : 137951

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : spare
   Array State : A..A ('A' == active, '.' == missing)
/dev/sdc3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b5920e6d:0a82a92c:f28da4e8:ae1074be
           Name : NAS542:2
  Creation Time : Tue May  1 16:36:23 2018
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660736 (11166.25 GiB 11989.67 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 43040323:9f443330:80a71650:861cf46b

    Update Time : Fri Apr 23 03:30:57 2021
       Checksum : be86cdf1 - correct
         Events : 137936

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sdd3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b5920e6d:0a82a92c:f28da4e8:ae1074be
           Name : NAS542:2
  Creation Time : Tue May  1 16:36:23 2018
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB)
     Array Size : 11708660736 (11166.25 GiB 11989.67 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 8c3bec27:8c41d54b:00be3528:e441668f

    Update Time : Fri Apr 23 07:27:07 2021
       Checksum : daa26cea - correct
         Events : 137951

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 3
   Array State : A..A ('A' == active, '.' == missing)




All Replies

  • Mijzelf
    Mijzelf Posts: 2,600  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    It's amazing that this is reproducible. I think your sdc disk has a hardware problem around 70%. That can be an unused part of the disk, which has to be read anyway to rebuild the redundancy.

    It should not be reproducible. At first one of the disk failed, and left you with a degraded array. Then you supplied a new disk, and the array started recovering. One of the old disks failed at 70%, the array went down, and wrote that in the headers. 
    So after a reboot, the headers still tell the array is down, and it shouldn't start recovering. If you look at the states of the disks in your last mdadm dump, you get:
     
    partitionstamprolestate
    sda3Fri Apr 23 07:27:07 2021Active device 0A..A
    sdb3Fri Apr 23 07:27:07 2021spareA..A
    sdc3Fri Apr 23 03:30:57 2021Active device 2AAAA
    sdd3Fri Apr 23 07:27:07 2021Active device 3A..A

    Sdc is the failing disk. You can see that it has a diverging timestamp. The array manager didn't touch this disk after it revealed it's problem. Sdb is the new disk. As it was not yet completely filled, it cannot be a part of the array. Yet there is nothing wrong with it, and it's assigned to be part of the array, so it's a spare now. And it's header reflects the state of the array.
    On boot the array manager should ignore sdc3, as it's out of sync, according to it's header. The other 3 members agree about the state of the array.

    So how to proceed? If your rebuild/fail loop is reproducible, I guess that it will boot with a proper degraded array when you pull sdb. In that case you can backup your data. If the hardware problem of sdc is in an unused part of the disk, it is possible that you can complete the backup without problems.
    If that is not an option, you can try to create a bitwise copy of sdc using ddrescue or something like that. After that you'll have a healthy disk with one or more 'holes' filled with zero's on the surface. Files which contain such a hole will be
  • bradcornford
    bradcornford Posts: 2
    edited April 2021
    Thanks for the reply Mijzelf . So should I get another new disk, to replace sdc also? Or as I still have the old sdb disk, i could put that back, then replace sdc with my new disk instead and hope it rebuilds sdc? Although im not sure if it tried to already rebuild sdb already.

    For reference, my array is 90% full is that makes any difference.

  • Mijzelf
    Mijzelf Posts: 2,600  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    edited April 2021
    So should I get another new disk, to replace sdc also?

    Depends on the value of your data, and the SMART values. It is possible that the sector blocking the rebuild is healthy, but just lost it's data, causing a sector checksum error. In that case the disk will be fine if you just overwrite the sector. If it's a hard error, the sector died, then it can mean the disk is dying. Ideally you keep track of the number of died sectors, to see if it's increasing. It's normal that a TB size disk has some (<~10) dead sectors, but it shouldn't grow (fast).
    The command to look at this is
    smartctl -a /dev/sdc
    The attribute name for dead sectors is Reallocated_Sector_Ct. It actually tells how much sectors are replaced by a spare sector. If this number (raw value) is growing, the disk should be replaced. Of course it's possible that there are dead sectors the disks isn't aware of. They only pop up when the sector is accessed. On a healthy system you can force that by
    cat /dev/sdc >/dev/null
    which copies the content of the whole surface to the universal recycle bin. Takes hours. Don't do that on a degraded array. Odds are that a wonky disk will die under that load. And always first read the Reallocated_Sector_Ct, you want to know if it increases.
    A soft error is Current_Pending_Sector, that is the number of sectors the disk is aware that the data is lost, but which are fine otherways.
    Or as I still have the old sdb disk, i could put that back, then replace sdc with my new disk instead and hope it rebuilds sdc?
    As said, I don't understand that the rebuild cycle is reproducible for you. The array should be down, with only two healthy members left. It shouldn't accept sdc as healthy member on a reboot, as it's timestamp is older than the other members timestamps. The same is true for sdb. It has an even further diverging timestamp, and it can even contain other data. Your array can be written to as degraded array, making sdb not 'fitting' anymore. (For sdc that is not the case, as soon as it was dropped from the array, the array went down, so no further writing was possible.)
    But I don't know why your sd[acd] array still assembles, so I can't say if sd[abd] will do.

    I can't look in your wallet, but I would exchange sdc, by a bitwise copy made using another system (because the NAS doesn't have ddrescue), and examine sd[ad] for health. Especially if you don't have a backup. (Which you should have. Disks are sensitive devices which will die. And if they die (semi-)simultaneously you don't have time to replace them in time.)
    For reference, my array is 90% full is that makes any difference.
    Then the odds that the problem sector is not in use by the filesystem is 10%.



Consumer Product Help Center