NAS 542 raid 5 volume down

124

All Replies

  • Mijzelf
    Mijzelf Posts: 2,790  Guru Member
    250 Answers 2500 Comments Friend Collector Seventh Anniversary
    Hmm... are we back at square 1?
    Oh right. Forgot about that, but now I look back, yes. The unreliable disk is exchanged, but the raid array is still the same.
    Your original array had
      Creation Time : Tue Apr 28 15:12:52 2015
     Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
         Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
             Layout : left-symmetric
         Chunk Size : 64K
    The new array
      Creation Time : Wed Oct  5 13:17:25 2022
     Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
         Array Size : 8778405312 (8371.74 GiB 8989.09 GB)
      Used Dev Size : 5852270208 (2790.58 GiB 2996.36 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
             Layout : left-symmetric
         Chunk Size : 64K
    So indeed there is a difference in array size. The old one 8778405888 blocks (of 1KiB), the new one 8778405312. That's a difference o of 576KiB.
    [  533.041160] EXT4-fs (md2): bad geometry: block count 2194601472 exceeds size of device (2194601328 blocks)
    This are blocks of 4KiB, and it matches both sizes of the arrays. The problem is here:
     Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
      Used Dev Size : 5852270208 (2790.58 GiB 2996.36 GB)
    Each array member has 384 blocks (of 512 bytes, this time) not in use. That is 1152 block in total, which is 576KiB. So the math fits. Don't know why this blocks are not used. A quick google didn't reveal much. You can first try
    mdadm --grow /dev/md2 --size=max
    to reclaim the unused blocks.

  • BjoWis
    BjoWis Posts: 33  Freshman Member
    First Comment Friend Collector
    Tried that... and that seems to have turned things in a new direction =)
    ~ # mdadm --grow /dev/md2 --size=max
    mdadm: component size of /dev/md2 has been set to 2926135296K

    I then tried to mount it manually

    ~ # mkdir -p /tmp/mountpoint
    ~ # mount /dev/md2 /tmp/mountpoint
    ~ #
    Then I rebooted the NAS and opened up the WUI;




    Tonight my delivery from Amazon Prime arrives with a brand new Seagate Ironwolf 3 TB disk. I will insert that as Disk1 and hopefully it should be possible to repair the degraded Volume.

    Any other suggestions?

  • Mijzelf
    Mijzelf Posts: 2,790  Guru Member
    250 Answers 2500 Comments Friend Collector Seventh Anniversary
    BjoWis said:

    Any other suggestions?

    Yes, two. First, make sure you have backups in future. You have seen that raid is not a backup. Second, you lost 40MB of data in the ddrescue copying. There is a possibility that there is an inconsistency in the filesystem now. Problem is that inconsistencies can grow, causing data-loss in the future. So you should run e2fsck on it. When I remember well the webinterface has an option for that. If not, it gets complicated, as you can't repair a mounted filesystem, and the firmware doesn't want you to unmount it. But you can run
    e2fsck -nf /dev/md2
    (which doesn't actually change anything) on the mounted filesystem, to see *if* there are any problems, before bullying the firmware. Do not use the filesystem while e2fsck is running, it doesn't like changes on the fly.

  • BjoWis
    BjoWis Posts: 33  Freshman Member
    First Comment Friend Collector
    Ok, here's the result, tried it 3 times;

    1st try;
    ~ # e2fsck -nf /dev/md2
    e2fsck 1.42.12 (29-Aug-2014)
    Warning!  /dev/md2 is mounted.
    Warning: skipping journal recovery because doing a read-only filesystem check.
    Pass 1: Checking inodes, blocks, and sizes
    Pass 2: Checking directory structure
    Pass 3: Checking directory connectivity
    Pass 4: Checking reference counts
    Pass 5: Checking group summary information
    Error reading block 1067450378 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps.  Ignore error? no

    e2fsck: Can't read a block bitmap while retrying to read bitmaps for /dev/md2
    e2fsck: aborted

    2nd try;

    ~ # e2fsck -nf /dev/md2
    e2fsck 1.42.12 (29-Aug-2014)
    Warning!  /dev/md2 is mounted.
    Warning: skipping journal recovery because doing a read-only filesystem check.
    Pass 1: Checking inodes, blocks, and sizes
    Error reading block 650 (Attempt to read block from filesystem resulted in short read).  Ignore error? no

    Error while iterating over blocks in inode 7: Attempt to read block from filesystem resulted in short read
    e2fsck: aborted
    3rd try;
    ~ # e2fsck -nf /dev/md2
    e2fsck 1.42.12 (29-Aug-2014)
    Warning!  /dev/md2 is mounted.
    Warning: skipping journal recovery because doing a read-only filesystem check.
    Pass 1: Checking inodes, blocks, and sizes
    Error reading block 650 (Attempt to read block from filesystem resulted in short read).  Ignore error? no

    Error while iterating over blocks in inode 7: Attempt to read block from filesystem resulted in short read
    e2fsck: aborted


  • Mijzelf
    Mijzelf Posts: 2,790  Guru Member
    250 Answers 2500 Comments Friend Collector Seventh Anniversary
    Sorry, can't interpret that. In most cases e2fsck would yell and ultimately do it's job and tell you if the filesystem is consistent. But apparently the filesystem is too busy. (I hope. 'Error reading block 650' is scaring, but I can't imagine it's real, because in that case the disk would have been dropped from the array, and you would have told that.)
    So you have the choice to ignore possible inconsistencies, or unmount the filesystem to be able to scan it. Unmounting is not a trivial task because the system volume is in use in several ways. But you can hook the shutdown script to let it do the task for you. I wrote about that here. (Sorry for the layout. You'll have to read between the html tags. $%@! forumsoftware.)
  • BjoWis
    BjoWis Posts: 33  Freshman Member
    First Comment Friend Collector
    Ok, I've inserted the new disk now and repaired the RAID via the WUI. I started it last night and this morning the re-synching had reached 40 %.

    But now almost 12 hours later, the WUI doesn't respond when I try to log on.

    But I can log on via Putty...

    =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2022.11.10 16:46:58 =~=~=~=~=~=~=~=~=~=~=~=
    login as: admin
    admin@192.168.1.157's password:


    BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    ~ $ cat /proc/mdstat
    Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
    md2 : active raid5 sdd3[4](S) sda3[1] sdb3[3] sdc3[2](F)
          8778405888 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/2] [_U_U]
          
    md1 : active raid1 sdd2[5] sda2[6] sdb2[4] sdc2[2]
          1998784 blocks super 1.2 [4/4] [UUUU]
          
    md0 : active raid1 sdd1[7] sda1[4] sdc1[5] sdb1[6]
          1997760 blocks super 1.2 [4/4] [UUUU]
          
    unused devices: <none>
    ~ $ su
    Password:
    BusyBox v1.19.4 (2022-08-11 15:13:21 CST) built-in shell (ash)
    Enter 'help' for a list of built-in commands.

    ~ # mdadm --examine /dev/sd[abcd]3
    /dev/sda3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
               Name : NAS542:2  (local to host NAS542)
      Creation Time : Wed Oct  5 13:17:25 2022
         Raid Level : raid5
       Raid Devices : 4

     Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
         Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : f7c083fd:fe37f383:55424937:52ec4bd2

        Update Time : Thu Nov 10 16:44:31 2022
           Checksum : 478913a1 - correct
             Events : 1146

             Layout : left-symmetric
         Chunk Size : 64K

       Device Role : Active device 1
       Array State : .A.A ('A' == active, '.' == missing)
    /dev/sdb3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
               Name : NAS542:2  (local to host NAS542)
      Creation Time : Wed Oct  5 13:17:25 2022
         Raid Level : raid5
       Raid Devices : 4

     Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
         Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : 3e046eeb:3bac7e38:2c4e2408:d05d7a1d

        Update Time : Thu Nov 10 16:44:31 2022
           Checksum : 608487b - correct
             Events : 1146

             Layout : left-symmetric
         Chunk Size : 64K

       Device Role : Active device 3
       Array State : .A.A ('A' == active, '.' == missing)
    /dev/sdc3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
               Name : NAS542:2  (local to host NAS542)
      Creation Time : Wed Oct  5 13:17:25 2022
         Raid Level : raid5
       Raid Devices : 4

     Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
         Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : 803b4d20:57570076:2d7c62d1:e4bf567a

        Update Time : Thu Nov 10 09:48:48 2022
           Checksum : 9e815971 - correct
             Events : 1128

             Layout : left-symmetric
         Chunk Size : 64K

       Device Role : Active device 2
       Array State : AAAA ('A' == active, '.' == missing)
    /dev/sdd3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 555ccb7e:e9b29adc:2b39eea0:9329542f
               Name : NAS542:2  (local to host NAS542)
      Creation Time : Wed Oct  5 13:17:25 2022
         Raid Level : raid5
       Raid Devices : 4

     Avail Dev Size : 5852270592 (2790.58 GiB 2996.36 GB)
         Array Size : 8778405888 (8371.74 GiB 8989.09 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : bbb0ea76:a788ca76:4b8dced1:3fac6750

        Update Time : Thu Nov 10 16:44:31 2022
           Checksum : cc685ef3 - correct
             Events : 1146

             Layout : left-symmetric
         Chunk Size : 64K

       Device Role : spare
       Array State : .A.A ('A' == active, '.' == missing)
    ~ #

    Is it safe to reboot the box? Or are there some other way to see how the progress of the re-synch is progressing?
  • Mijzelf
    Mijzelf Posts: 2,790  Guru Member
    250 Answers 2500 Comments Friend Collector Seventh Anniversary
    You have not been very lucky lately. The array is down. This morning at 09:48:48 (UTC, I think) disk sdc was dropped from the array. And as the new disk was not yet fully build, it is now 'spare'.
    I suppose you'll see an I/O error on sdc, when you run dmesg now.
    Normally you can see the rebuild percentage in /proc/mdstat, and I think the webinterface also reads&shows that. Rebooting now won't add any more damage.
  • BjoWis
    BjoWis Posts: 33  Freshman Member
    First Comment Friend Collector
    OMG - this never ends...

    I ran the dmesg command - you'll find it attached. So another of the old disks are gone?
    ~ # /proc/mdstat
    sh: /proc/mdstat: Permission denied
    So I need to reboot, to see what's wrong with the array?
  • BjoWis
    BjoWis Posts: 33  Freshman Member
    First Comment Friend Collector
    I've rebooted the NAS and now I can reach the WUI again;

    The array is down, all disks are identified as hot spares.

    What about removing Disk 1, 2 and 4 and then try to run e2fsck on Disk 3 (the one I ddrescued)?



  • Mijzelf
    Mijzelf Posts: 2,790  Guru Member
    250 Answers 2500 Comments Friend Collector Seventh Anniversary
    So indeed disk sdc was dropped:
    [83596.572175] end_request: I/O error, dev sdc, sector 2854795984
    [83596.578040] md/raid:md2: read error not correctable (sector 2846796496 on sdc3).
    [83596.585458] md/raid:md2: Disk failure on sdc3, disabling device.
    [83596.585462] md/raid:md2: Operation continuing on 2 devices.
    Sector 2846796496 is at around 1.4TB, so on a 3TB disk that is around 40%. 'Operation continuing on 2 devices'. On that moment the array was down, and the filling of the new disk stopped. The frozen webinterface is a bug, I suppose the backend can't handle a failed array rebuild.
    ~ # /proc/mdstat
    sh: /proc/mdstat: Permission denied
    /proc/mdstat is a (virtual) file (as all files in /proc/ it gives a live peek into kernel structures). To read it you need to view the content: 'cat /proc/mdstat'.
    What about removing Disk 1, 2 and 4 and then try to run e2fsck on Disk 3 (the one I ddrescued)?
    You can't. e2fsck checks the filesystem. There is no filesystem on a single array member. The array looks like ABCdABcDAbCDaBCDABCd, where each character is a 64kB chunk, the A chunks are from disk 1, the B from disk 2, ... . The capitals are 'plain data', the lower case an parity on the other 3. The filesystem (basically a database with files) is put on top of this array (On the capitals. When the array is degraded it has to calculate the missing capitals from the other 2 and the parity). What you are proposing now is to lift al C's and try to repair the filesystem on them.
    If you have pulled all disks except the 'ddrecued one', you can run 'mdadm --examine /dev/sd[abcd]3 on it, to see if it is indeed the problem one. In that case it's 'Device role' is 'Active device 2'. In that case this shouldn't have happened.
    It happens often that a raid5 array gets degraded, and then goes down while rebuilding with a new member. There reason is that raid is pretty dumb. It doesn't know about the filesystem, but simply reads the whole surface to calculate the content of the 4th member. When a read error occurs, that disk is dropped from the array. When the array was already degraded, that is in most cases not a good action. The raid manager doesn't even know if that data is used at all.
    As (in your case) the array is only 50% used, there is a chance that the unreadable sector has never been touched. So a bad sector only gets detected when you are trying to rebuild your degraded array.
    When sdc is an 'old' disk, this can happen. When you create a raid array on 4 new disks, put a filesystem on it and use that for 7 years, and the array is never full, than there are millions of sectors which have never been used. But if it is the new disk, this shouldn't happen. We know all sectors have been written, as it is a bit-by-bit copy of another disk. And so all sectors should be readable for at least some years.
    When it's an old disk, it could be a Current Pending Sector problem. On hardware level nothing serious, just very inconvenient to popup during a raid rebuild. When it's the new disk, a sector you have written last week is now unreadable. That is a serious error, and that disk can't be trusted anymore.
    If the new disk is unreliable, you have 2 options. You can re-create the array with the 3 members you had before you added the 4th disk. (have a look at mdadm --examine to find out the Device Role of each partition, the command at the beginning of this thread might not fit, as the device nodes (sda,sdb,...) can be changed) and when the array is up again, backup all data to an external disk, in hope the problem sector is not used, and won't trigger a drop again.
    The other option is to ddrescue the disk (this time it probably won't take that long, as the new disk is mainly, well, new)
    After that, RMA the new disk.
    When the error is an old disk, you can ddrescue it to the latest disk you got, then re-create the array, and when it's fine add the old disk as 4th member. (Or phase out the 2 remaining old disks, they are 7 years old, and two of them already completely died).
    When I re-read this I think my writing wouldn't get a high mark, but I hope you can sift the information I try to offer.

Consumer Product Help Center