NAS542: HDD issues, but RAID status "healthy"

TomasMalina · October 2023

Hi,

I'm having some issues with my RAID on NAS542, 4x8TB (ST8000DM004).

The problem: Disk 2 is likely degraded, but still running. I would like to ask for some assistance resolving it.

Details:
I noticed that when I login to mycloud.zyxel.me, the NAS status shows "RAID statusWarning", while the used disk space on one of the two logical partitions is displayed wrong (it shows it uses 350 GB, while the data that I can access via webdav or web interface are surely over 1TB). However, when I log in to the web interface, it only shows that the RAID is "healthy" (no warning). (I am not sure whether the read-out of the used space on the other partition is correct, but it is probably at least in the right ballpark).

I searched the forum/web for similar problems, I can state this:

-Opening tweaks and viewing the disk log, I get lots of errors similar to this
"[ 605.917511] EXT4-fs error (device dm-2): ext4_lookup:1047: inode #52101255: comm zyxel_file_moni: deleted inode referenced: 62128248"

-I logged in over ssh as root and ran "cat /proc/mdstat", which shows that the arrays are resyncing. However, after observing for several hours/days, it never finishes, it may reach 10 % and then "crash" and start over (it shows 10-13000 minutes to finish, a few hours ago I saw it being at 11+ %, now it shows 0.3 %, so it had to start over, probably after running into some errors).

-I ran "smartctl -a /dev/sdd" on the problematic disk, which gave me errors like this:

"Error 55 occurred at disk power-on lifetime: 31491 hours (1312 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH

40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

25 00 00 ff ff ff ef 00 3d+05:23:11.286 READ DMA EXT
ef 10 02 00 00 00 a0 00 3d+05:23:11.268 SET FEATURES [Enable SATA feature]
27 00 00 00 00 00 e0 00 3d+05:23:11.241 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
ec 00 00 00 00 00 a0 00 3d+05:23:11.238 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 3d+05:23:11.226 SET FEATURES [Set transfer mode]"

The SMART check, however, displays
"SMART overall-health self-assessment test result: PASSED"

-All of the values in SMART read either "old-age" or "pre-fail".

-I should probably try to run e2fsck (or so I found), but I don't really know how (the drive/device needs to be unmounted and I don't know if I can do that when the RAID is trying to sync). I can access (I think) all of the data on the logical partition for now, but I'm afraid that the drive might fail soon.

-I tried to set up the "smartctl -t short" test on the problematic drive, but I am not sure if it did something or how to view the result, or even if it can repair any errors. (I did reboot the NAS after setting up the test, hoping it will run on startup.)

-Once before I had to replace a failed drive, the arrays then got rebuilt (I suppose - I was able to access the NAS again with all the data after about 10 days or rebuilding). Thus, while sd[cdf] have all ~32k hours running time, sde has "only" 18k hrs.

-Probably unrelated issues:

-I commonly have trouble logging in as admin into the web interface - I get an error for wrong credentials. Rebooting from root over ssh "resets" this issue, sometimes it helps just to log in and log out as regular user into the web interface.

-I started having trouble copying files over webdav - I get various errors in windows file manager resulting in no files being copied. I can, however, connect an external drive to the NAS USB ports and transfer the files using cp command from powershell. This problem doesn't occur always, typically it occurs when large files are involved (100MB+).

-One of the LAN connections shows only as 100MBit instead of 1GBit. The other port works faster.

What would be the best course of actions now? I am not experienced with linux commands at all, so if the guidance can be as much step-by-step as possible, it would be much appreciated.

Thank you.

Mijzelf · October 2023

Reallocated Sector Count went from 0 to 7184

Apart from everything else, this is bad. The RSC may be unequal to zero, even on a brand new disk, but a jump from a few thousand in a few days is a death announcement.

In the webinterface it however allowed me to rebuild it on the same disk
- when I had a disk failure before, it didn't allow me to do that until
I replaced the failed drive.

That only means the 1st time the disk had actually died, while the disk now was 'only' dropped because of read/write failure during operation. It's a design decision to allow that disk to be added again. And to be fair, in some cases (Current Pending Sector errors) that is justified.

So, now I'm running a rebuild on the array, suspiciously, it is about
twice as fast as the resync before, so after 1 day I am almost half way
through.

That is still the same disk? I highly recommend you to exchange it.

On another note, after the drive was dropped (even with the RAID rebuilding on the failing disk right now),
now the volumes report the correct used/available space again.

That means that the filesystem used to get invalid sectors from the bad disk. That shouldn't be possible, unless the disk is more unreliable than expected. Normally the filesystem reads a sector from the array, the array passes that to the right disk. The disk provides the sector. The disk has checksums to check if the sector has the right data. If not, it reports that upstream, gets dropped, and the requested sector is calculated from the remaining raid members. Now apparently the bad disk provided corrupted sectors, without even knowing that.

Mijzelf · October 2023

-All of the values in SMART read either "old-age" or "pre-fail".

That are not values, it are categories. 'old-age' values tell something about the aging of the disk, 'pre-fail' something over the chance on a nearby fail. It is the value itself what matters, not the category. And because it's hard to interpret the raw values, the vendor also provides a 'VALUE' which counts down from 200 or 100 (very healthy) to zero. (Disk is basically dead), and also gives a THRESH, below which you should be concerned.

-I should probably try to run e2fsck (or so I found), but I don't really
know how (the drive/device needs to be unmounted and I don't know if I
can do that when the RAID is trying to sync).

Yes, you can. The filesystem is independent from the raid array. Having the filesystem unmounted is not an easy task though, I have written about that in other threads. Basically you need to intercept the shutdown script to be able to get a shell after the firmware unmounted the filesystems.

But seeing the other symptoms I'd guess that one disk is acting up, and needs to be replaced. The raid array shows healthy in the webinterface. Is it possible that the array you see resyncing in /proc/mdstat is not the data array, but the internal firmware array?

TomasMalina · October 2023

Thanks for your reply.

Ok, to give better context, the SMART values from the "faulty" drive are:

I've tried to follow one of your posts for running e2fsck, but I guess I did not do something right, I wasn't allowed to run e2fsck anyways. I followed your post here from Oct 28 2019:

https://community.zyxel.com/en/discussion/8090/nas542-lost-folders-files-inside-share

I checked this, the reply was 0 (no HW issues, I hope):

cat /sys/block/md2/md/mismatch_cnt

I did the setup for e2fsck:

cp /usr/sbin/e2fsck /sbin/
cp -a /usr/lib/libext2fs.so* /lib/
cp -a /usr/lib/libcom_err.so* /lib/
cp -a /usr/lib/libpthread.so* /lib/
cp -a /usr/lib/libe2p.so* /lib/
touch /tmp/wait

vi /etc/init.d/rc.shutdown

#inserted the 5 lines below just above "mdadm -Ss"

telnetd
while [ -f /tmp/wait ]
do
sleep 5
done

then rebooted (reboot command), waited for the NAS to start up (beep) and logged in over ssh as root.

Then, I still got a message

e2fsck 1.42.12 (29-Aug-2014)
/dev/md2 is in use.
e2fsck: Cannot continue, aborting.

When I checked the rc.shutdown file, it no longer had the 5 lines inserted above mdadm -Ss, it must have gotten overwriten on reboot.

I then tried to repeat the steps and shut down the box using the power button, but it failed to shut down (during this time, I was unable to log in over ssh, connection refused), so after about 15 minutes I pressed the power button long enough for 2 beeps. Then, after booting up, it again had the 5 lines missing in the rc.shutdown file.

The same result (with md2 being mounted and the 5 lines ignored/deleted) happens when I shut down the NAS from web interface.

Mijzelf · October 2023

The SMART values are hard to interpret. But the Seek_Error_Rate and Raw_Read_Error_Rate RAW_VALUE is high. I just checked it on another disk I have here, and they are both 0. Unfortunately the unit is not given, and it seems to be a status, not a counter, as in both cases WORST is lower than the current VALUE. In both cases it's still well above the vendor provided THRESH.

Is this disk a special 'raid' disk? If not, the THRESH might be set too low. The raid manager can be more picky on errors than a filesystem would be. On a seek or read error the filesystem driver just retries, while the raid manager drops the disk from the array. Although your Linux software raid is less picky than a hardware raid controller. Anyway, if the other disks are the same brand/type, compare those values. If the others are significantly lower, you should replace this disk.

About e2fsck, the idea is that the box will not reboot, but during shutdown run a telnet daemon, and wait forever in the 'while ; do done' loop. Then you can login over telnet, and do what has to be done. If it reboots, possibly the file /tmp/wait didn't exist anymore?

That the 5 lines are gone after a reboot (and also the files you copied to /tmp, /lib and /sbin) is normal. The rootfs system is a ramdrive, which is created freshly on each boot.

TomasMalina · October 2023

All the disks (the second one is the one with fs errors) read similar values, I'd say. They are not special NAS/RAID drives, just regular storage HDDs. All of them are the exact same model, though, ST8000DM004.

Ah, then I misunderstood the old thread.

Anyway, after allowing telnet in UPnP settings, I am still unable to login over telnet, getting "connection failed" error. When I run the rc.shutdown script, it indeed does hang, but returns a lot of errors in the process:

I am not able to get it to hang when shutting down from webinterface, using ssh reboot command, or by pressing the power button on the box.

Now the box is hanging (in the infinite loop), but I get

I did set up the UPnP telnet connection to port 23. When the box is running, I can connect over telnet just fine.

Mijzelf · October 2023

I agree all disks are equally healthy. (Although disk#3 is half the age of the others).

uPnP

You shouldn't need it. uPnP is a request to the router to forward this ports. Unless you want to access your NAS from outside your network, this is unnecessary. (And possibly even harmful, as open ports will attract malware and scriptkiddies. When your passwords are not safe, this is dangerous).

Normally the shutdown status calls rc.shutdown. Running rc.shutdown won't shutdown the box. But for it's purpose here, killing all services to be able to unmount the data array it could be run manually.

The errors are normal. rc.shutdown calls /etc/init.d/zypkg_controller.sh to stop all running packages. And instead of looking which packages are running, it first tries to stop every known package, even if you haven't installed it.

Unfortunately this script also kills telnetd/ssh, so your connection is lost before the script reaches the place where you edited it. So error messages which can tell why it doesn't work are not visible.

You could try to put your added lines in a 'any' file, and run that as script, to see if it errors out:

touch /tmp/wait

vi myfile
#insert the 5 lines

sh myfile

TomasMalina · October 2023

Ah, I messed up, I did instert the 5 lines, but without the spacing before the sleep. Now I was able to make the NAS hang mid-shutdown with telnet daemon active, but nevertheless, I got errors that it cannot unmount logical drives in use:

Now, when I try to run e2fsck, I still get a message that md2 is in use, unfortunately.

Mijzelf · October 2023

You have logical volumes on the raid array. Which means md2 doesn't contain a filesystem, but a logical volume database. Don't know how the device node, which is needed for e2fsck is called (/dev/dm0?), but you can see that if you run

cat /proc/mounts

while up, which lists all devices and their mountpoint. (And some other mount info.) The mountpoint you are looking for is /e-data/<some-hex-string>, you should run e2fsck on the associated device node, in the 'shutdown telnet shell'.

TomasMalina · October 2023

Thank you for being patient with me! I've started e2fsck on /dev/dm-2 that was listed among the errors in Tweaks and now it runs.

Thank you so much for your help, will report back once it finishes.

Edit:

So, I managed to run it, ran it a few times with the -y flag so that I don't have to click yes to repairing every inode. However, it runs for about 10 minutes and then it says "aborted". No other message that would explain why. When I run it again, it does sometimes pop up other inodes (different numbers than last time), sometimes nothing. I am not certain it is continuing from where it left off, seems to me it might just run out of memory at some point and crash. Is there a way to help it run all the way?

I found a similar question on Superuser https://superuser.com/questions/1248832/cant-run-fsck-e2fsck-aborted , where the solution was to create a config file for e2fsck

and then update e2fsck from sourceforge. I connected a 64 GB USB stick to the NAS, created a cache/e2fsck file on it

and linked it into the e2fsck.conf.

However, this results in the error above, e2fsck probably cannot access the external storage?

I didn't succeed in the second step either, command

wget http://downloads.sourceforge.net/project/e2fsprogs/e2fsprogs/v1.47.0/e2fsprogs-1.47.0.tar.gz

results in an error

Mijzelf · October 2023

Memory allocation failed. The box is out of memory. When checking and repairing large filesystems e2fsck can need a lot of memory. And I think there is no swap anymore, at that stage of shutdown.

You can try to add swap on your USB stick:

dd if=/dev/zero of=/e-data/<long-hex-code>/swapspace bs=1M count=4000
mkswap /e-data/<long-hex-code>/swapspace
swapon /e-data/<long-hex-code>/swapspace

That will create a 4GB swapfile on your stick, and enable it as swap. The command

cat /proc/swaps

shows if it is in use.

TomasMalina · October 2023

I tried to do this, allocated a 12GB swap file on the exFAT USB stick, but got an error on the last command:

I'm reformatting the USB stick to ext4 to see it that helps.

Edit: Yes, ext4 is accepted by the swapon command, exFAT is not. Running the e2fsck command again, hopefully it will continue for more than 15 minutes.

..but, even with a swapfile like this (taken after 1 round of e2fsck, it seems to use only 33MB of it)

I still get the same memory allocation error after about 15 minutes of e2fsck running

(I did modify the config file so that it corresponds to the different hex code of the usb stick after formatting and reconnecting):

Edit2:

If this doesn't work, would running a loop on the e2fsck maybe help run it all the way? It always seems to fix something, maybe running it enough times could fix it all? Or will it need lots of memory even when there is nothing to repair? I'm thinking of maybe something like:

counter = 0
while [ $counter -lt 100]
do
e2fsck -y /dev/dm-2
let counter=counter+1
done

Is this syntax correct, and how would I go about running it on the NAS (as a superuser I suppose)?

NAS542: HDD issues, but RAID status "healthy"

Accepted Solution

All Replies

Categories

Consumer Product Help Center