How do I recover a volume after repair process silently failed?
All Replies
-
Maybe. Did you reboot already? You can check if the new headers describe the same array type,offset,blocksize as before:mdadm --examine /dev/sd[abcd]3Further don't trust the firmware. Have a look at the kernel view on the array:cat /proc/mdstat0
-
Oops! My apologies, Mijzelf. Following the reboot here are the mdadm outputs. I'm partially back up, with the degraded volume,
~ # mdadm --examine /dev/sda3 /dev/sda3: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : fa2bac0d:b9adfa1a:a4dcc64b:fc7a555b Name : NAS540:2 (local to host NAS540) Creation Time : Tue Dec 1 11:31:20 2020 Raid Level : raid5 Raid Devices : 4 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB) Array Size : 11708660160 (11166.25 GiB 11989.67 GB) Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : clean Device UUID : 5b44e4bd:e37142b2:23d26a4d:cb281462 Update Time : Wed Dec 2 13:18:45 2020 Checksum : 61fc7575 - correct Events : 122 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 0 Array State : A.AA ('A' == active, '.' == missing) ~ # mdadm --examine /dev/sdb3 /dev/sdb3: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : ad82b6f7:6aacc5f3:c7a86a8b:25240df4 Name : NAS540:2 (local to host NAS540) Creation Time : Thu Jul 27 14:12:32 2017 Raid Level : raid5 Raid Devices : 4 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB) Array Size : 11708660160 (11166.25 GiB 11989.67 GB) Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : clean Device UUID : 0a5c35b6:3bd8a182:5030b8be:51bbe238 Update Time : Thu Oct 22 22:21:39 2020 Checksum : 77ae1fd - correct Events : 47 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 1 Array State : AAAA ('A' == active, '.' == missing) ~ # mdadm --examine /dev/sdc3 /dev/sdc3: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : fa2bac0d:b9adfa1a:a4dcc64b:fc7a555b Name : NAS540:2 (local to host NAS540) Creation Time : Tue Dec 1 11:31:20 2020 Raid Level : raid5 Raid Devices : 4 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB) Array Size : 11708660160 (11166.25 GiB 11989.67 GB) Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : clean Device UUID : ebcb668f:f6c07008:9efd8d2f:ad7314ad Update Time : Wed Dec 2 13:18:45 2020 Checksum : b6d0c276 - correct Events : 122 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 2 Array State : A.AA ('A' == active, '.' == missing) ~ # mdadm --examine /dev/sdd3 /dev/sdd3: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : fa2bac0d:b9adfa1a:a4dcc64b:fc7a555b Name : NAS540:2 (local to host NAS540) Creation Time : Tue Dec 1 11:31:20 2020 Raid Level : raid5 Raid Devices : 4 Avail Dev Size : 7805773824 (3722.08 GiB 3996.56 GB) Array Size : 11708660160 (11166.25 GiB 11989.67 GB) Used Dev Size : 7805773440 (3722.08 GiB 3996.56 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : clean Device UUID : 460883bd:f423662f:25ca9304:9fe9a52e Update Time : Wed Dec 2 13:18:50 2020 Checksum : 6279a450 - correct Events : 124 Layout : left-symmetric Chunk Size : 64K Device Role : Active device 3 Array State : A.AA ('A' == active, '.' == missing)
and~ # cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md3 : inactive sdb3[1](S) 3902886912 blocks super 1.2 md2 : active raid5 sda3[0] sdd3[3] sdc3[2] 11708660160 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [U_UU] md1 : active raid1 sda2[0] sdd2[3] sdc2[2] sdb2[4] 1998784 blocks super 1.2 [4/4] [UUUU] md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[4] 1997760 blocks super 1.2 [4/4] [UUUU] unused devices:
It looks like the missing drive is the one I'm still hoping to recover, which shows `Lost` (the amelia-1 share),
Anything left to try?
I really appreciate your effort.
0 -
The headers look good, and the array is up, and seeing that 'Lost' changed to 'Disabled', I think you only have to enter the Shares menu to enable them.As far as the firmware knows, you put in 3 disks containing a new volume, so the 'old' shares are no longer available.0
-
I've been able to get a decent amount off the drive once I enabled them. (It's very very slow and still obviously beeping.) After this completes, are there any diagnostics I can run to fully assess each disk and make sure all are in working order before reseting it?0
-
It's very very slow and still obviously beeping.
About beeping
buzzerc -s && mv /sbin/buzzerc /sbin/buzzerc.old
will stop the buzzer, and remove the possibility for the firmware to start it again. Till the next reboot.About slow, it shouldn't be much slower than before. The array is degraded, which means one out of 3 blocks has to be recalculated from the parity, but the NAS can do that at 1GB/sec, so hardly noticeable. The box should do 75-100MB/sec for big files. (Which of course, is still 37 hours for 10TB)
After this completes, are there any diagnostics I can run to fully assess each diskThat's complicated. The SMART values of the disks can tell if the disks themselves 'feel healthy'.
It is possible that SMART says the disks are completely healthy, while still a disk will be dropped if you try to add a 4th disk. The reason is aging of the data. A modern harddisk has a very high data density. A single bit is a few square nm, and so only a few dozens of magnetic atoms. Ideally those atoms are all orientated the same, so a clear 0 or 1 can be read. But due to thermal noise over time some atoms can loose their orientation, blurring the signal. At some moment it's not longer possible to tell if it's a 0 or 1. Because of this the sector has some extra bits, to be able to restore a few unreadable bits, but sometimes it stops, and the sector is unreadable. The disk will try several times to read the sector, because to positioning of the head is not 100% reproducable, so a new read will pull in some other, maybe readable atoms.Finally the disk will report an I/O error, the sector is not readable. The raid manager will drop the disk. But, this disk is perfectly healthy. One sector is not readable, but you can simply write new data to it. If you would know what to write.
The solution is to 'resilver' the disk, which means reading each sector and writing it back. This way all atoms are orientated again, and ready for years. (It is possible that the sector which caused your problem hasn't been written to since factory. If you succeed in copying all data, you have proved the problem sector is not in use by the filesystem). Modern filesystem like ZFS en BTRFS have build-in functions for this, but unfortunately the used software raid hasn't, AFAIK.For really unusable sectors the disk has a number of spare sectors, which can replace unusable sectors. In the SMART values there is an entry for that: "Reallocated Sectors Count". The raw value is the number of replaced sectors, the percentage is the amount of spare sectors left.To find if a disk is still trustworthy, you should make a note of the raw value and percentage, overwrite the complete disk, and look if the value didn't change much. If not, all sectors are readable again (you just re-orientated all atoms) and there was not a significant number of hard failing sectors.Unfortunately this will kill your data. I'm not aware of any way to resilver the data on the disk reliable. A naive way isdd if=/dev/md2 of=/dev/md2 bs=16MThis will copy all data from md2 to md2 in blocks of 16M. That should resilver the whole surface, but unfortunately it will stop at first read error. And it's dangerous to do while the filesystem is mounted, as you could overwrite pending changes with older data.dd if=/dev/zero of=/dev/md2 bs=16Mwill write zeros to md2. It will also possibly overwrite pending changes, but as the filesystem is destroyed anyway, that doesn't matter. But make sure to read the SMART data before and after.
0
Categories
- All Categories
- 415 Beta Program
- 2.5K Nebula
- 152 Nebula Ideas
- 101 Nebula Status and Incidents
- 5.8K Security
- 296 USG FLEX H Series
- 281 Security Ideas
- 1.5K Switch
- 77 Switch Ideas
- 1.1K Wireless
- 42 Wireless Ideas
- 6.5K Consumer Product
- 254 Service & License
- 396 News and Release
- 85 Security Advisories
- 29 Education Center
- 10 [Campaign] Zyxel Network Detective
- 3.6K FAQ
- 34 Documents
- 34 Nebula Monthly Express
- 87 About Community
- 76 Security Highlight