NAS542: HDD issues, but RAID status "healthy"
All Replies
-
(I did modify the config file so that it corresponds to the different
hex code of the usb stick after formatting and reconnecting):This is wrong. The swapspace is handled transparently by the OS, and e2fsck is not even supposed to know if it is using swapspace. Don't know what that cache is, but it can definitely not be handled by the same file.
If this doesn't work, would running a loop on the e2fsck maybe help run it all the way?
Don't know. AFAIK e2fsck creates a shadow node tree in memory, and when the filesystem is too big (too many files), it won't fit. Maybe it can repair some problems while building, but some problems can only be found when the whole map is build. (lost clusters, doubly used sectors, …)
Is this syntax correct,
Almost. You need a space between 100 and ]. And there should be no spaces in the counter=0. This runs:
counter=0
while [ $counter -lt 100 ]
do
echo $counter
let counter=counter+1
doneBut it can be smarter. The manpage of e2fsck says
EXIT CODE
The exit code returned by e2fsck is the sum of the following conditions:
0 - No errors
1 - File system errors corrected
2 - File system errors corrected, system should
be rebooted
4 - File system errors left uncorrected
8 - Operational error
16 - Usage or syntax error
32 - E2fsck canceled by user request
128 - Shared library errorSo when all is done, e2fsck returns 0. That evaluates to true, in shellscript, so
counter=0
while [ $counter -lt 100 ]
do
if e2fsck -y /dev/dm-2 then echo "Done!" exit 0 fi
let counter=counter+1
doneruns no more times than needed.
as a superuser I suppose
That's called root in Linux. And yes, only root is allowed to run e2fsck.
0 -
Hm, so even though there is a swapfile for the system to use, it only uses a small portion of it and e2fsck still gives a Memory allocation error. (I've tried multiple combinations of 2-12GB files either designated as swapfiles and/or linked in the e2fsck.conf file, none of which seems to have an efect on e2fsck crashes so far, unfortunately.)
Any ideas, how to get it to run to completion? :/
I've copied the code into a .sh file, but when I try to run in, it say permisson denied, so I'm unable to run it even partially, unfortunately.
Edit: Ah, it's 'su root file.sh'
0 -
Any ideas, how to get it to run to completion? :/
Unfortunately, no. If adding swap doesn't solve an out-of-memory error, then possibly it can't complete. I don't know, but if there exists a filesystem error which erroneously lets e2fsck think it needs a block of 100GB, then it will fail.
Edit: Ah, it's 'su root file.sh'
Don't know why that works. According to your screenshot you were already running as root. (The # prompt tells that. An ordinary user has a $). The error was that the script didn't have the executable flag set, and so you didn't have permission to execute it. The canonical solution is to either run in in sh, or to set the executable bit:
sh /tmp/e2loop.sh
or
chmod a+x /tmp/e2loop.sh /tmp/e2loop.sh
I suppose su invokes the shell of the selected user to execute the given payload.
0 -
Unfortunately, no. If adding swap doesn't solve an out-of-memory error, then possibly it can't complete.
Anyway, thank you for your help.
Is the box 32-bit, or would adding a swap larger than 4GB help? I already tested that adding more 2/4GB swaps doesn't help, it can only use one when running e2fsck.
Edit: Or.. would removing the (faulty) disk from the array, connecting it to another (64-bit) PC and running e2fsck work? Is there a way to identify which bay the disk is in?
0 -
Is the box 32-bit, or would adding a swap larger than 4GB help?
It's an Armv7 SoC, so that is 32 bits. That doesn't necessarily mean the processor cannot address more than 4GiB, but a single process cannot. So 4GB swap together with 1GB ram should be sufficient to give e2fsck all memory it can handle.
would removing the (faulty) disk from the array, connecting it to another (64-bit) PC and running e2fsck work?
No. The filesystem is on the array, not on the disk. You need at least 3 disks to bring up the filesystem. (And if you do so, the 4th disk is no longer part of the array.)
Is there a way to identify which bay the disk is in?
AFAIK the disks are sda,sdb,sdc and sdd from left to right, seen from the front. But if you stop the raid arrays in your telnet shell (mdadm -Ss, and check with 'cat /proc/mdstat' if it succeeded) you can simply pull the second disk, and check in kernel log (dmesg) or partition table (cat /proc/partitions) which one you pulled. If it's the wrong one put it back before you boot the box again.
0 -
Thank youfor the info!
Also, is there a way to safely stop the raid resync?
Even when the NAS is mid-shutdown, it still tries to sync (and fails after several hrs/days when it reaches the faulty part of the second drive, crashing the NAS). Also, it eats up a lot of processing power that I believe could be used for the e2fsck.
I found that I could set a "speed limit", effectively pausing the sync, but I'm not sure if it is a good idea or if I don't break it for the future.
echo 0 > /proc/sys/dev/raid/speed_limit_max
0 -
'mdadm -Ss' should stop and disassemble the raid arrays, and so also stop the sync process.
You 'speed limit' has no future consequences. The /proc directory is a virtual directory, and all files in it do not really exist. They are a peek into kernel structures. (Just like /sys). So nothing is stored on disk or in flash.
0 -
Thank you very much for all your help.
The e2fsck finally finished today after many repeated attempts (every run it corrected a few errors slowly reaching the end, not e2fsck terminates shortly after starting saying :clean). Likely, this was thanks to setting up the swap file.
SMART, however, still reports there had been (and likely still are?) errors. Short smartctl test didn't find any errors, though, I'm running the long test now, but I'm not sure whether it fixes any errors, or if it just reports on them.
0 -
AFAIK Smart never 'fixes' things, it just reports. Most errors cannot be fixed anyway, as they are hardware problems. I know of 2 exceptions, Current_Pending_Sector, which are basically sectors which have lost their magnetity to a level it cannot be read anymore. That can be fixed by rewriting them. Of course in most cases the original data is lost. And UDMA_CRC_Error_Count, which is in most cases the sata cable. So exchanging or re-plugging can help to stop the counter. You can't reset it.
Anyway, your error is the same as last time, error 55 since new. And it was at 31491 hours, while 4 days ago you were already on hour 31575. So in the last 7 days no new disk errors have occurred, while the disk the last 4 days has been running constantly. I wouldn't care about that. Just keep an eye on the disks to see if values aren't changing fast.
That doesn't tell why the resyncing starts over and over again. Maybe the kernel log can be helpful, If you catch it shortly after it restarted.
0 -
Ok, I'll keep an eye on it, thank you.
That doesn't tell why the resyncing starts over and over again. Maybe the kernel log can be helpful, If you catch it shortly after it restarted.
I've looked around, but I cannot find a kernel log on the NAS, I assume it first has to be generated somehow (/sbin/kernelcheck maybe..?)? Running "find / -name kern*" doesn't seem to show anything log-like.
0
Categories
- All Categories
- 415 Beta Program
- 2.4K Nebula
- 144 Nebula Ideas
- 94 Nebula Status and Incidents
- 5.6K Security
- 237 USG FLEX H Series
- 267 Security Ideas
- 1.4K Switch
- 71 Switch Ideas
- 1.1K Wireless
- 40 Wireless Ideas
- 6.3K Consumer Product
- 247 Service & License
- 384 News and Release
- 83 Security Advisories
- 29 Education Center
- 10 [Campaign] Zyxel Network Detective
- 3.2K FAQ
- 34 Documents
- 34 Nebula Monthly Express
- 83 About Community
- 71 Security Highlight