NAS540 reboots when RAID resync would finish

Mijzelf · January 2023

restart after resync completes:

Resync completed? That's in the logs? And after reboot it started to resync again?

A filesystem error has basically nothing to do with the raid array below it. The raid manager doesn't care about the content of the array, and so it shouldn't resync because of that filesystem errors.

Normally during resync the raidheaders are regularly updated, so that in case of reboot it can continue where it was interrupted. And when done that is also written to all 4 headers, so after reboot it just finds a healthy array. The only reason I can think of why it starts over, is because it failed in writing that header. But if it failed, it failed silently, as the raidmanager would stop immediately when an I/O error occurred during the mid-term writes.

That loglines tell about smbd which tries to access a directory which is faulty. That lines didn't show up during resync?

poloschka · January 2023

Since the resync is restarted, I started to collect the logs again with putty configured properly. I might catch something this time.

Resync completed? That's in the logs? And after reboot it started to resync again?

Yes, aside from dmesg output, I added /proc/mdstat to the output logfile as well. That showed me this:

<div>Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]</div><div>md2 : active raid5 sda3[0] sdd3[4] sdc3[2] sdb3[1]</div><div>&nbsp; &nbsp; &nbsp; 5848151040 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]</div><div><br></div><div>md1 : active raid1 sda2[0] sdd2[4] sdc2[2] sdb2[1]</div><div>&nbsp; &nbsp; &nbsp; 1998784 blocks super 1.2 [4/4] [UUUU]</div><div><br></div><div>md0 : active raid1 sda1[0] sdd1[4] sdc1[2] sdb1[1]</div><div>&nbsp; &nbsp; &nbsp; 1997760 blocks super 1.2 [4/4] [UUUU]</div>

This was the last thing logged, NAS restarted afterwards.

poloschka · January 2023

As I was writing my previous post, the NAS restarted and the resync was nowhere near completion.
Attached the log of this. Although it contains nothing more than the logs before. Only filesystem errors.
Needless to say resync has started again from 0%.

Mijzelf · January 2023

Long shot. Maybe somehow the raid header on the new disk contains some garbage, which isn't removed when the header is updated, but causes the rebuild to restart every time.

Shutdown the NAS, remove the 3 'good' disks, and power it on. Then execute on a command prompt

su

dd if=/dev/zero of=/dev/sda bs=1M count=8k

That will overwrite with zero's the first 8GB of the remaining disk, which contains the partition table, the 2 system partitions (2GB each) and the beginning of the data partition, including the raid header.

Then run dmesg to see if it caused any I/O errors. If not, powerdown the NAS, reinsert the disks, power it on and re-initiate a sync.

poloschka · January 2023

Thanks for the advice, fortunately to my surprise, resync has completed without rebooting. I rebooted the NAS myself to check if the resync is whether permanent or not and the volume is not resyncing now. I am not sure what had happened. The resync was running from 2023-01-10 22:56:49 to 2023-01-11 21:19:21 and I caught the log of completion:

<div>[81183.348897] md: md2: resync done.</div><div>[81183.429394] RAID conf printout:</div><div>[81183.429407]&nbsp; --- level:5 rd:4 wd:4</div><div>[81183.429414]&nbsp; disk 0, o:1, dev:sda3</div><div>[81183.429420]&nbsp; disk 1, o:1, dev:sdb3</div><div>[81183.429425]&nbsp; disk 2, o:1, dev:sdc3</div><div>[81183.429431]&nbsp; disk 3, o:1, dev:sdd3</div>

the NAS was running after that until 2023-01-12 10:33:53 when I made the reboot. Now it looks okay.

Can the filesystem erros be fixed?

Also, just noticed, that the files - not all of them - seem to be damaged:

Image: https://us.v-cdn.net/6029482/uploads/editor/1a/kocryilv0699.png

I know this does not look good...Can this be fixed somehow? I think the data is still there. Is this related to the filesystem errors?

Mijzelf · January 2023

So the resync took almost 24 hours? That means it wrote around 23MB/sec to the new disk. A bit low.

Can the filesystem erros be fixed?

Possibly. One of the problems is you can't repair a mounted filesystem. So it has to be umounted first, and that is not easy. There is a work-around, by intercepting the shutdown. I wrote about that here. And sorry, this forum b0rkes up everything, you'll have to remove the html tags yourself. And you can omit the resize2fs, the goal is the e2fsck.

But it's possible that repairing cannot be done, or not completely, depending on the damage.

Is this related to the filesystem errors?

Could be. Filesystem errors can cause all kinds of corruptions.

poloschka · January 2023

Thank you for the guide Mijzelf! Now the Filesystem errors are gone, the file browser does not give HTTP 500 error and the media server is also working fine.

Unfortunately there was data loss. Is there a way to tell which files have been lost?

Mijzelf · January 2023

There are 2 kinds of lost. Corrupted and gone. Unless you have a list of checksums of the files, I wouldn't know how to find the corrupted files, other than manually.

Lost files are files (or directories) from which the parent directory got lost, and e2fsck has put them in lost+found in the root of the filesystem. So in your case that would be /i-data/sysvol/lost+found/. That directory is only accessible by root.

poloschka · January 2023

I have 95113 items in lost+found. There are files and directories. How can I restore them? Guides online mention the "file" command but that is missing from the NAS.

Mijzelf · January 2023

Define 'restore'. The path of those files/directories are lost, so it's impossible to put them back where they belong. I think all metadata (including names) of most files in lost+found are gone, but files in a subdirectory still have their metadata.

Restoring in the meaning of making them accessible over samba is a simple move:

mkdir /i-data/sysvol/admin/restored

cd /i-data/sysvol/lost+found/

mv * /i-data/sysvol/admin/restored/

followed by a chown

chown admin /i-data/sysvol/admin/restored/*

Maybe you'll get a command line overflow error in mv. In that case you can use smaller chunks.

mv a* /i-data/sysvol/admin/restored/

mv b* /i-data/sysvol/admin/restored/

Depending on the actual contents of lost+found.

I suppose that guides using 'file' are trying to restore the extension of files with generic names?

NAS540 reboots when RAID resync would finish

All Replies

Categories

Consumer Product Help Center