Zyxel NSA325 v2 shows status degraded after HDD replacement

Hi,

as one of my 2 2TB HDDs (working in RAID1 Mode) died a week ago, I replaced it with another 2TB HDD.
Initially after replacement all went well, the System started the repair and after a wile the volume status switched to "Healthy".

Now it shows "Degraded" again - even if now both disks are showing up as "Healthy" in the S.M.A.R.T menu.

Rebooting the NSA leads to show the Voluem Status for a short while as  " Recovering" with an increasing % (last was 71.3%), but switches to "Degraded" after a few minutes.

Selecting "Repair Volume" in the Volume menu leads to an error message in the Bottom Status line " Disk capacity must be equal to or greater than the smallest disk in the RAID."

The 2 Disks currently in use are:
Seagate ST2000VN000 (old)

Seagate ST2000VN004 (new)

Both showing 1.82 Capacity and a Healthy status.

Thanks in advance for any help or useful hints.

BR
lintux









Best Answer

  • Mijzelf
    Mijzelf Posts: 1,796  Guru Member
    Accepted Answer
    lintux said:
    dmesg shows an over and over repeating message:
     --- wd:1 rd:2
     disk 0, wo:0, o:1, dev:sda2
     disk 1, wo:1, o:1, dev:sdb2
    Your system hit a bug, somewhere. This should be in the log only once per state change of the array, AFAIK.
    I *think* you have a 'pending read error'  on your source disk. This is a (healthy) sector where the checksum doesn't match (some bit toggled), and so it cannot be read. It is not considered a hardware bug, as you can write the sector normally, after which it can be read again. So Smart doesn't disapprove the disk (although the pending read errors can be found in the details) but the rebuilding of the array stops there, and as the sector is never written, it doesn't solve over time.
    Don't know what would be a good strategy now. An option would be to backup everything, and create a new volume, if you have enough external storage for that.
    Complicating factor is that ZyXEL in their wisdom decided to switch off the package server for the EOL NASses, so you can't reinstall any package.
    Another option is to remove the new disk, and fill up the volume to the rim, in hope the pending read error is in a sector which is not in use by the filesystem.
    This can be done with

    dd if=/dev/zero of=/i-data/md0/admin/bigfile bs=16M

    After that remove the 'bigfile', and insert the new disk again, to let it rebuild again. Possibly the pending read error sector is overwritten. A problem with this approach is that you are stressing the disk, while it's twin brother already died. So what are the odds you kill this remaining disk?

    That is outside my cultural luggage, I'm afraid. Google was able to tell me you are speaking about some British television show I never heard of.

Answers

  • Mijzelf
    Mijzelf Posts: 1,796  Guru Member
    Can you login over ssh or over the Telnet backdoor (as root, using the admin password) and post the output of

    cat /proc/mdstat
    cat /proc/partition
    mdadm --examine /dev/sd[ab]2

  • lintux
    lintux Posts: 4
    Sure, thanks for the reply ..

    cat /proc/mdstat:

    Personalities : [linear] [raid0] [raid1]
    md0 : active raid1 sda2[0] sdb2[2]
          1952996792 blocks super 1.2 [2/1] [U_]

    unused devices: <none>


    cat /proc/partitions:

    major minor  #blocks  name

       7        0     143360 loop0
       8        0 1953514584 sda
       8        1     514048 sda1
       8        2 1952997952 sda2
       8       16 1953514584 sdb
       8       17     514048 sdb1
       8       18 1952997952 sdb2
      31        0       1024 mtdblock0
      31        1        512 mtdblock1
      31        2        512 mtdblock2
      31        3        512 mtdblock3
      31        4      10240 mtdblock4
      31        5      10240 mtdblock5
      31        6      48896 mtdblock6
      31        7      10240 mtdblock7
      31        8      48896 mtdblock8
       9        0 1952996792 md0


    mdadm --examine /dev/sd[ab]2:


    /dev/sda2:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : dfcd2b59:6c86c85e:7ca3a57f:6afb15af
               Name : NSA325-v2:0
      Creation Time : Thu Feb 19 19:54:55 2015
         Raid Level : raid1
       Raid Devices : 2

     Avail Dev Size : 1952996928 (1862.52 GiB 1999.87 GB)
         Array Size : 1952996792 (1862.52 GiB 1999.87 GB)
      Used Dev Size : 1952996792 (1862.52 GiB 1999.87 GB)
        Data Offset : 2048 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : cf9b8394:f0f82f74:969d3c47:6caf059e

        Update Time : Fri Oct  8 15:57:58 2021
           Checksum : af584a48 - correct
             Events : 221724


       Device Role : Active device 0
       Array State : AA ('A' == active, '.' == missing)
    /dev/sdb2:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x2
         Array UUID : dfcd2b59:6c86c85e:7ca3a57f:6afb15af
               Name : NSA325-v2:0
      Creation Time : Thu Feb 19 19:54:55 2015
         Raid Level : raid1
       Raid Devices : 2

     Avail Dev Size : 1952996928 (1862.52 GiB 1999.87 GB)
         Array Size : 1952996792 (1862.52 GiB 1999.87 GB)
      Used Dev Size : 1952996792 (1862.52 GiB 1999.87 GB)
        Data Offset : 2048 sectors
       Super Offset : 8 sectors
    Recovery Offset : 2830471168 sectors
              State : clean
        Device UUID : 85b632ed:e8a6d661:cb6e0426:0a1ddc0e

        Update Time : Fri Oct  8 15:57:58 2021
           Checksum : ee01e1cd - correct
             Events : 221724


       Device Role : Active device 1
       Array State : AA ('A' == active, '.' == missing)



    To me it looks not really bad?!?

    BR
    lintux










  • Mijzelf
    Mijzelf Posts: 1,796  Guru Member
    lintux said:
    To me it looks not really bad?!?
    Well, the bad thing is that we shouldn't be seeing this. Both raid members agree that they are in an array, healthy, and both last updated on the same time, today around 4PM. That will be UTC, so that is around the time you posted. Yet the raid manager in the kernel tells the array is degraded.
    I don't think you need to worry about the weird size message from the firmware. The firmware has the same info, and doesn't know what is happening either.
    Maybe the kernel log has some interesting info (dmesg), or the array itself. (mdadm --detail /dev/md0)

  • mMontana
    mMontana Posts: 424  Master Member
    Clarkson Mode ON.
    Some says... the firmware cannot understand mdadm?
  • Mijzelf
    Mijzelf Posts: 1,796  Guru Member
    Clarkson?

    All ZyXEL NASses understand mdadm, even the single disk ones.
  • lintux
    lintux Posts: 4
    dmesg shows an over and over repeating message:
     --- wd:1 rd:2
     disk 0, wo:0, o:1, dev:sda2
     disk 1, wo:1, o:1, dev:sdb2


    mdadm --detail /dev/md0
    /dev/md0:
            Version : 1.2
      Creation Time : Thu Feb 19 19:54:55 2015
         Raid Level : raid1
         Array Size : 1952996792 (1862.52 GiB 1999.87 GB)
      Used Dev Size : 1952996792 (1862.52 GiB 1999.87 GB)
       Raid Devices : 2
      Total Devices : 2
        Persistence : Superblock is persistent

        Update Time : Sat Oct  9 10:05:12 2021
              State : clean, degraded
     Active Devices : 1
    Working Devices : 2
     Failed Devices : 0
      Spare Devices : 1

               Name : NSA325-v2:0
               UUID : dfcd2b59:6c86c85e:7ca3a57f:6afb15af
             Events : 222350

        Number   Major   Minor   RaidDevice State
           0       8        2        0      active sync   /dev/sda2
           2       8       18        1      spare rebuilding   /dev/sdb2

    I dont try an interpretation this time ..
  • mMontana
    mMontana Posts: 424  Master Member
    edited October 9
    Mijzelf said:
    Clarkson?
    Jeremy Clarkson, Mister "High Gear no-more". A re-interpretation of the catch phrase about the Stig.
    "some says...." "all we know it's called the Stig.
  • lintux
    lintux Posts: 4
    Thanks for the help. I'll try to back up all data, replace the old HDD by a new model might be wise anyway to have not a pair of disk so different in age) and build the volumes from scratch (an approach I thought I could avoid with a Raid 1 in place).

    BR lintux