NAS542: HDD issues, but RAID status "healthy"

Mijzelf · October 2023

You can read the kernel log by executing dmesg. Or 'dmesg | less' if you want to scroll.

TomasMalina · October 2023

Oh, right, so, here is the log from last crash (till now). There are some "Bad eraseblock" entries, some "EXT4-fs (sda1): unable to read superblock" (but sda1 is in my case the connected USB stick with the swapfile). Then there is a

md/raid:md2: not clean -- starting background reconstruction

but that may have been fixed by the e2fsck already.

I might have encountered another issue though, I noticed that my second logical volume (16TB) is mounted (EDIT: this is during the "hanging shutdown") and "umount /etc/zyxel" results in an error, that it is being used.

trying to find out why it is busy ("lsof | grep /etc/zyxel") shows that "dservice" and some backupjob are actively using the volume.

Mijzelf · October 2023

Well, that doesn't help much. If the log contains a resync restart, it's completely silent. You can see at 55.92 it detects the array, and sees it needs resync, and so it starts resyncing. At 141.43 the buzzer is started. That is the beep at the end of the boot process. Then, at 1092.06 (around 18 minutes) the syncing is stopped, and resumed with a larger minimum guaranteed speed. But that is not a restart.

There are a few things I don't understand, swapfile1 is added twice, once at 817.92, and once at 9130.36. (2.5 hours after boot). I suppose you manually added that. But there is no trace of removing the swapfile in between.

And what happened at 134040.36 (37 hours after boot)? It seems to remount the firmware partition, and the loopdevice on it. Did you somehow initiate that?

About /etc/zyxel, that is the mountpoint of an internal flash partition, containing user settings. When it can't be unmounted because processes are still busy, that might be the cause of corrupted settings, which sometimes show up in this forum. But it has nothing to do with your raid problem. BTW, you could try to kill that processes:

killall -9 dservice
killall -9 schedule_

(The latter name is probably truncated. Use ps or top to see it's real name), and try to unmount /etc/zyxel.

There are some "Bad eraseblock" entries

That is normal. There is 256MiB raw flash in the box, and there are always a handful of bad eraseblocks. The driver handles that.

some "EXT4-fs (sda1): unable to read superblock" (but sda1 is in my case the connected USB stick
with the swapfile).

I suppose there is no ext4 filesystem on that stick?

Then there is a md/raid:md2: not clean -- starting background reconstruction but that may have been fixed by the e2fsck already.

Nope. Filesystem and raid array are not related. The raid array just supplies a virtual harddisk. And somebody decided to put a filesystem on it (on a logical volume, in a logical volume group). The array knows nothing about it's content. It just stores and reads sectors. You can have a perfectly healthy raid array with a badly destroyed filesystem on it. The other way around is harder, of course.

TomasMalina · October 2023

swapfile1 is added twice, once at 817.92, and once at 9130.36

I think I did that, yes, I'm almost positive I did remove it inbetween, I was just testing whether somehow the set priority might affect its usefulness (I turned it on with priority 0 instead of -1 for the second time). Why is it missing from the log, that I don't know.

And what happened at 134040.36 (37 hours after boot)?

At this instant, I was checking the progress of the raid sync or the smart test (mdstat or smartctl, don't remember exactly which one) and I got disconnected from the NAS after sending the command. I thought the NAS had crashed, but I was able to reconnect via telnet immediately. It did all the other things (mounting partitions) on its own.

When I check the info about available storage

I can still see my 16TB partition mounted (and I can access the data on it). I thought unmounting /etc/zyxel/ would help, it didn't.

The partition is mounted on dev/mapper/<some_code>, but it cannot be unmounted due to being busy. Viewing lsof of /dev/mapper doesn't show anything. Since the 16TB partition also spans the damaged disk, would it be wise to run e2fsck on it as well? How would I go about (unmounting it and) running fsck?

I suppose there is no ext4 filesystem on that stick?

Well, the stick is formatted to ext4, but "df" shows the bulk of the memory is /dev/sda5, not sda1, so sda1 is probably just some other volume created automatically on the stick (smartctl shows sda1 is indeed on the usb stick).

Mijzelf · October 2023

The partition is mounted on dev/mapper/<some_code>, but it cannot be unmounted due to being busy. Viewing lsof of /dev/mapper doesn't show anything. Since the 16TB partition also spans the damaged disk, would it be wise to run e2fsck on it as well? How would I go about (unmounting it and) running fsck?

No, de mountpoint is /i-data/<some_code>. The /dev/mapper/ thing is the device node. So you should filter your lsof output with /i-data/<some_code>

BTW, I think either /dev/dm-2 is a symlink to a /dev/mapper/ file, or the other way around. I can't imagine that the same mechanism yields two different types of device node filenames.

And running lsof on that volume won't hurt, but it won't solve your array problems either. A damaged filesystem can be a result of a (virtual) disk problem, but never be the cause.

My analysis of your situation: For some reason disk2 was dropped from the array, and manually or automatically marked as needing as resynced. During resync the raid manager calculates the content of the disk from the other members of the array. So for every sector being written on disk 2, 3 sectors are read from disk 1,3 and 4. That syncing never finishes, but starts over. Normally, when one of the source sectors cannot be read, or the target sector cannot be written, the corresponding disk is dropped from the array, leaving you with a degraded array when it was the target disk, and the array will be down (only 2 members left) when it was a source disk. For some reason in your case there is no disk dropped, but the syncing is restarted. I don't know why. So far the kernel log didn't tell anything, and I think the kernel log is the only source of information, as the resync process is a kernel process.

It cannot be a reboot. By design the progress is regularly written to the raid header, so after a break it can continue where it left. (Or shortly before that). An RW error during sync results in a dropped disk. I wonder if a meta RW could be handled different? Don't know how often the raid header is updated, not too often, as the random seek time will kill you, so let's say every 10 seconds. What if updating the header fails? Maybe that doesn't cause a drop, but a restart. When every 10 seconds the header is updated, and an update is a read-write cycle, and the resyncing is restarted every 10 hours, that means that the header is read 3600 times before a restart. You Raw_Read_Error_Rate is a number of 9 digits. Total_LBAs_Read is a number of 12 digits. So if Raw_Read__Error_Rate is a counter, one out of 1000 reads result in an error. That is family of that 3600. I don't know if this math fits, as I don't know the unit of Raw_Read_Error_Rate. (And I don't know how severe Raw_Read_Error_Rate is. It could be something handled in the disk, automatically corrected by a second read, so from outside you'd never notice this, and the raid manager cannot react on it)

I think you should make a backup of your precious files, if you haven't done yet. As long as disk 2 is resyncing there is no redundancy, so a failing disk will kill your data.

TomasMalina · October 2023

Sorry for the long pause. I've let it run over the weekend and after reaching about 80-ish % on the resync (the highest I've seen it get so far) it crashed. This time, when I looked at it after reboot, it showed a degraded array, disk 2 was indeed dropped. In the webinterface it however allowed me to rebuild it on the same disk - when I had a disk failure before, it didn't allow me to do that until I replaced the failed drive. So, now I'm running a rebuild on the array, suspiciously, it is about twice as fast as the resync before, so after 1 day I am almost half way through.

NAS542_dmesg2.log

Anyway, here is the current dmesg log. At 199, I apparently started the rebuild, at 14500 and 17500 it recorded some errors. Checking the smartctl again revealed the disk indeed encountered some errors since last I checked (before the weekend): Reallocated Sector Count went from 0 to 7184, Runtime Bad Block went from 780 to 797, Current Pending Sectors and Offline Uncorrectable both went from 0 to 8, but Reported Uncorrected Errors stayed the same. I guess the errors were indeed an indication of an imminentdrive failure, and trying to corect them only sped up the failing. (Now at 93300 I remembered to eject the unused USB stick that used to contain the swapfile).

On another note, after the drive was dropped (even with the RAID rebuilding on the failing disk right now), now the volumes report the correct used/available space again.

Mijzelf · October 2023

Reallocated Sector Count went from 0 to 7184

Apart from everything else, this is bad. The RSC may be unequal to zero, even on a brand new disk, but a jump from a few thousand in a few days is a death announcement.

In the webinterface it however allowed me to rebuild it on the same disk
- when I had a disk failure before, it didn't allow me to do that until
I replaced the failed drive.

That only means the 1st time the disk had actually died, while the disk now was 'only' dropped because of read/write failure during operation. It's a design decision to allow that disk to be added again. And to be fair, in some cases (Current Pending Sector errors) that is justified.

So, now I'm running a rebuild on the array, suspiciously, it is about
twice as fast as the resync before, so after 1 day I am almost half way
through.

That is still the same disk? I highly recommend you to exchange it.

On another note, after the drive was dropped (even with the RAID rebuilding on the failing disk right now),
now the volumes report the correct used/available space again.

That means that the filesystem used to get invalid sectors from the bad disk. That shouldn't be possible, unless the disk is more unreliable than expected. Normally the filesystem reads a sector from the array, the array passes that to the right disk. The disk provides the sector. The disk has checksums to check if the sector has the right data. If not, it reports that upstream, gets dropped, and the requested sector is calculated from the remaining raid members. Now apparently the bad disk provided corrupted sectors, without even knowing that.

TomasMalina · October 2023

Thank you for all your precious help, I'm already waiting for a new drive, will exchange it as soon as it arrives.

NAS542: HDD issues, but RAID status "healthy"

All Replies

Categories

Consumer Product Help Center