NAS542: HDD issues, but RAID status "healthy"

Mijzelf · October 2023

(I did modify the config file so that it corresponds to the different
hex code of the usb stick after formatting and reconnecting):

This is wrong. The swapspace is handled transparently by the OS, and e2fsck is not even supposed to know if it is using swapspace. Don't know what that cache is, but it can definitely not be handled by the same file.

If this doesn't work, would running a loop on the e2fsck maybe help run it all the way?

Don't know. AFAIK e2fsck creates a shadow node tree in memory, and when the filesystem is too big (too many files), it won't fit. Maybe it can repair some problems while building, but some problems can only be found when the whole map is build. (lost clusters, doubly used sectors, …)

Is this syntax correct,

Almost. You need a space between 100 and ]. And there should be no spaces in the counter=0. This runs:

counter=0

while [ $counter -lt 100 ]

do

  echo $counter

  let counter=counter+1

done

But it can be smarter. The manpage of e2fsck says

EXIT CODE

The exit code returned by e2fsck is the sum of the following conditions:

0    - No errors

1    - File system errors corrected

2    - File system errors corrected, system should

be rebooted

4    - File system errors left uncorrected

8    - Operational error

16   - Usage or syntax error

32   - E2fsck canceled by user request

128  - Shared library error

So when all is done, e2fsck returns 0. That evaluates to true, in shellscript, so

counter=0

while [ $counter -lt 100 ]

do

  if e2fsck -y /dev/dm-2
  then
    echo "Done!"
    exit 0
  fi

  let counter=counter+1

done

runs no more times than needed.

as a superuser I suppose

That's called root in Linux. And yes, only root is allowed to run e2fsck.

TomasMalina · October 2023

Hm, so even though there is a swapfile for the system to use, it only uses a small portion of it and e2fsck still gives a Memory allocation error. (I've tried multiple combinations of 2-12GB files either designated as swapfiles and/or linked in the e2fsck.conf file, none of which seems to have an efect on e2fsck crashes so far, unfortunately.)

Any ideas, how to get it to run to completion? :/

I've copied the code into a .sh file, but when I try to run in, it say permisson denied, so I'm unable to run it even partially, unfortunately.

Edit: Ah, it's 'su root file.sh'

Mijzelf · October 2023

Any ideas, how to get it to run to completion? :/

Unfortunately, no. If adding swap doesn't solve an out-of-memory error, then possibly it can't complete. I don't know, but if there exists a filesystem error which erroneously lets e2fsck think it needs a block of 100GB, then it will fail.

Edit: Ah, it's 'su root file.sh'

Don't know why that works. According to your screenshot you were already running as root. (The # prompt tells that. An ordinary user has a $). The error was that the script didn't have the executable flag set, and so you didn't have permission to execute it. The canonical solution is to either run in in sh, or to set the executable bit:

sh /tmp/e2loop.sh

or

chmod a+x /tmp/e2loop.sh
/tmp/e2loop.sh

I suppose su invokes the shell of the selected user to execute the given payload.

TomasMalina · October 2023

Unfortunately, no. If adding swap doesn't solve an out-of-memory error, then possibly it can't complete.

Anyway, thank you for your help.

Is the box 32-bit, or would adding a swap larger than 4GB help? I already tested that adding more 2/4GB swaps doesn't help, it can only use one when running e2fsck.

Edit: Or.. would removing the (faulty) disk from the array, connecting it to another (64-bit) PC and running e2fsck work? Is there a way to identify which bay the disk is in?

Mijzelf · October 2023

Is the box 32-bit, or would adding a swap larger than 4GB help?

It's an Armv7 SoC, so that is 32 bits. That doesn't necessarily mean the processor cannot address more than 4GiB, but a single process cannot. So 4GB swap together with 1GB ram should be sufficient to give e2fsck all memory it can handle.

would removing the (faulty) disk from the array, connecting it to another (64-bit) PC and running e2fsck work?

No. The filesystem is on the array, not on the disk. You need at least 3 disks to bring up the filesystem. (And if you do so, the 4th disk is no longer part of the array.)

Is there a way to identify which bay the disk is in?

AFAIK the disks are sda,sdb,sdc and sdd from left to right, seen from the front. But if you stop the raid arrays in your telnet shell (mdadm -Ss, and check with 'cat /proc/mdstat' if it succeeded) you can simply pull the second disk, and check in kernel log (dmesg) or partition table (cat /proc/partitions) which one you pulled. If it's the wrong one put it back before you boot the box again.

TomasMalina · October 2023

Thank youfor the info!

Also, is there a way to safely stop the raid resync?

Even when the NAS is mid-shutdown, it still tries to sync (and fails after several hrs/days when it reaches the faulty part of the second drive, crashing the NAS). Also, it eats up a lot of processing power that I believe could be used for the e2fsck.

I found that I could set a "speed limit", effectively pausing the sync, but I'm not sure if it is a good idea or if I don't break it for the future.

echo 0 > /proc/sys/dev/raid/speed_limit_max

Mijzelf · October 2023

'mdadm -Ss' should stop and disassemble the raid arrays, and so also stop the sync process.

You 'speed limit' has no future consequences. The /proc directory is a virtual directory, and all files in it do not really exist. They are a peek into kernel structures. (Just like /sys). So nothing is stored on disk or in flash.

TomasMalina · October 2023

Thank you very much for all your help.

The e2fsck finally finished today after many repeated attempts (every run it corrected a few errors slowly reaching the end, not e2fsck terminates shortly after starting saying :clean). Likely, this was thanks to setting up the swap file.

SMART, however, still reports there had been (and likely still are?) errors. Short smartctl test didn't find any errors, though, I'm running the long test now, but I'm not sure whether it fixes any errors, or if it just reports on them.

Mijzelf · October 2023

AFAIK Smart never 'fixes' things, it just reports. Most errors cannot be fixed anyway, as they are hardware problems. I know of 2 exceptions, Current_Pending_Sector, which are basically sectors which have lost their magnetity to a level it cannot be read anymore. That can be fixed by rewriting them. Of course in most cases the original data is lost. And UDMA_CRC_Error_Count, which is in most cases the sata cable. So exchanging or re-plugging can help to stop the counter. You can't reset it.

Anyway, your error is the same as last time, error 55 since new. And it was at 31491 hours, while 4 days ago you were already on hour 31575. So in the last 7 days no new disk errors have occurred, while the disk the last 4 days has been running constantly. I wouldn't care about that. Just keep an eye on the disks to see if values aren't changing fast.

That doesn't tell why the resyncing starts over and over again. Maybe the kernel log can be helpful, If you catch it shortly after it restarted.

TomasMalina · October 2023

Ok, I'll keep an eye on it, thank you.

That doesn't tell why the resyncing starts over and over again. Maybe the kernel log can be helpful, If you catch it shortly after it restarted.

I've looked around, but I cannot find a kernel log on the NAS, I assume it first has to be generated somehow (/sbin/kernelcheck maybe..?)? Running "find / -name kern*" doesn't seem to show anything log-like.

NAS542: HDD issues, but RAID status "healthy"

All Replies

Categories

Consumer Product Help Center