NAS 542 - Firmware V5.21(ABAG.7) - There is a RAID Degraded

Hello,

my NAS 542 (Firmware V5.21(ABAG.7)) had reported a hardware issue with disk installed in a slot 4. Error message "There is a RAID Degraded."
The disk 4 was not visible but appeared back after the restart of NAS. But RAID was degraded and needed to be recreated.

I also observed that Disk 4 (all 1-4 disks are the same Toshiba 12 TB HDD) has always slightly higher operation temperature (54 vs 44 Grad Celsius)

Now RAID is being repaired.
It seems HDD in Disk 4 slot is still Ok. As well SMART info is Green, and I cannot identify any hardware issues with the Disk 4 itself.

My questions:
- based on your experience do you think the issue is related to HDD or is it something related to NAS/OS/Firmware itself?
- do you recommend to replace HDD in slot 4?
- which SMART KPIs I should be monitoring for HDD in Slot 4 to check its health?
- regarding temperature - is it acceptable that Disk 4 is always warmer (10 grad celsius than other disks in slot 1-3)
- what do you think was a root cause for RAID failure and how I can prevent it in future?

Any other advice is very welcome!
Thanks, -A

All Replies

  • Mijzelf
    Mijzelf Posts: 2,001  Guru Member
    A dropped array member is alas not strange, and doesn't necessarily mean the disk is bad. Actually raid is bad. A 'Current_Pending_Sector' error on a 'normal' filesystem will be promoted to userspace as 'I/O' error, and live goes on. When the disk is member of a raid array, the disk is dropped. While it could be repaired easily, by rewriting that sector from the redundant data.
    Now the whole disk is overwritten, just because the raid implementation failed to write that single sector. (And worse, if another 'Current_Pending_Sector' pops up while repairing, the array will drop another member, and your array is down, without easy way to get access to your data again)
    Having said that, it's a bit worrying that 'The disk 4 was not visible but appeared back after the restart of NAS'. A simple read error doesn't remove the disk from visibility, but I must admit that I don't know how the webinterface collects its data. So it could be nothing.

    - regarding temperature - is it acceptable that Disk 4 is always warmer (10 grad celsius than other disks in slot 1-3)
    I would want to know if it's the disk, or it's physical position. If it is the disk, there is something wrong with it. If it is the physical position, it is a bad design of the cabinet. Not really something to worry about, Toshiba specs 60 degrees as maximum. In my NAS I reverted the fan, to make it quieter, now it blows into the box instead of sucking the air out. And I think it also cools better.
    If you want to know if it's the disk, you can exchange disk 1 and 4, and see if the problem follows. (Of course wait until repair is finished, and switch the box off). No worries about your array, the physical sequence is not important. Each member has a header which describes it's role in the array.

    Make sure you have a backup. Don't think a raidarray is safe. It isn't it will fail some day.

  • Hi Mijzelf,

    thanks a lot for coming back to me.

    The repair process went Ok, and my RAID works again.

    Can you please elaborate on how to check 'Current_Pending_Sector' error and what is a root cause for this?

    Regarding the temperature - I assume it is the location. Disk 4 is the closest to the CPU. So the temperature is always increasing for 1-2 Grad Celsius going from Disk 1 to 4.

    It might be that my NAS542 experienced overheating (or Disk 4 went above 60 Grad) as I recently changed the physical location of NAS and it might be that it was overheated (it is only a speculation but I do not see any other reasonable explanation)

    I now moved the NAS to a place with a better air flow, hoping it will no longer be overheated.

    By the way - what is a normal recommended FAN RPM range? Is 1000 RPM ok? 

    If you have any other advice or any other parameter to check on OS/NAS level please let me know. If there is a way to read historical Temperature values on OS/root level, I am keen to learn as well.

    Thanks, and very best regards, -A




  • Mijzelf
    Mijzelf Posts: 2,001  Guru Member
    Can you please elaborate on how to check 'Current_Pending_Sector' error and what is a root cause for this?

    You can check it with smart. Login on the NAS over ssh (as root), and execute

    smartctl -a /dev/sda

    On of the attributes is Current_Pending_Sector, and it's 'raw_value' is the number of sectors. (The other 3 disks are /dev/sdb, /dev/sdc and /dev/sdd).

    The root cause is the enormous number of bits on a modern disk. Your disk is 12TB, which is around 100 000 000 000 000 bits on 2 or 3 platters. That is a few square nanometer per bit, and so only a few molecules. Their magnetic orientation is volatile, and sometimes a molecule will loose it's orientation. If enough molecules in a single bit loose their orientation, the bit can no longer be read. It's no longer a '1' or '0', but something in between. The disk will try to re-read a few times (on a next read the head will not have exactly the same position, so that makes sense), and if that fails, an I/O error is send downstream, and the sector is marked 'Current_Pending_Sector'. There is nothing wrong with that sector, only the data is lost. On next write to that sector, the 'Current_Pending_Sector' is revoked.

    A common way to prevent this is periodically 'scrubbing'. All sectors are read, while it is still possible, and written again, re-orientating all molecules. But AFAIK the NAS5xx doesn't support that.

    By the way - what is a normal recommended FAN RPM range? Is 1000 RPM ok?

    1000 rmp is quite high. Range is ~300 - ~1400, so the regulation thinks your box is hot. I suppose 1000 rpm is not exactly quiet. Is the box always busy? BTW, Some time ago I wrote a wiki page about the fan regulation.

    If there is a way to read historical Temperature values on OS/root level, I am keen to learn as well.
    AFAIK the OS doesn't log the CPU temperature. But maybe the disk itself keeps an historical log.
    smartctl -x /dev/sda
    might tell. On a random disk here it gives (among a lot of other data) the all time lowest and highest temperature, the lowest and highest temperature since last powercycle, and the temperature in the last 128 minutes. Oh, and the current temperature. You might want to execute
    smartctl -x /dev/sda | less
    to be able to scroll up and down.
  • aldapooh
    aldapooh Posts: 17
    edited November 2021
    wow! thank you million times Mijzelf for the insights.

    I was able to capture output of smartctl -a /dev/sda etc for all Disks 1-4

    ID#ATTRIBUTE_NAME             SDASDBSDCSDD1Raw_Read_Error_Rate00002Throughput_Performance00003Spin_Up_Time71037136705569614Start_Stop_Count5325164804815Reallocated_Sector_Ct00007Seek_Error_Rate00008Seek_Time_Performance00009Power_On_Hours142914291429142210Spin_Retry_Count000012Power_Cycle_Count7474747423Unknown_Attribute000024Unknown_Attribute0000191G-Sense_Error_Rate0001192Power-Off_Retract_Count9999193Load_Cycle_Count1056105110291007194Temperature_Celsius25 (Min/Max 20/53)25 (Min/Max 20/54)26 (Min/Max 20/54)27 (Min/Max 20/57)196Reallocated_Event_Count0000197Current_Pending_Sector0000198Offline_Uncorrectable0000199UDMA_CRC_Error_Count0000220Disk_Shift52428852428818350081835008222Loaded_Hours601590604591223Load_Retry_Count0000224Load_Friction0000226Load-in_Time592595590594240Head_Flying_Hours0000
    

    see picture because of the display issues:


    is there anything which I should be particularly worried about?

    Do you agree with my assumption that the failed RAID was probably due to overheating? And if I can better control the temperature I should not worry much?
    Or do you recommend something else?

    PS Sure - backups should be in place - but it costly :)

    Thanks, and best regards, -A
  • Mijzelf
    Mijzelf Posts: 2,001  Guru Member
    You smart values look fine. The detective in me asks why disk 4 has less power_on_hours, but that is not significant.
    Do you agree with my assumption that the failed RAID was probably due to overheating? And if I can better control the temperature I should not worry much?
    Maybe my lack of English language feeling is playing trick on me, but I think 'probably' is too strong. 'Possibly' would be better. There is no evidence that the disk ever reached the highest specified operating temperature, and even if it did, it would not immediate cause an error. It's just a limit within the vendor guarantees the disk will perform in spec. Exceeding it will not immediately kill the disk, but it should be avoided.
    But of course a hot disk has a shorter live than a cool one, so heat can be a factor.

  • Mijzelf said:
    The detective in me asks why disk 4 has less power_on_hours, but that is not significant.


    Thanks again for checking the SMART values.

    I assume power_on_hours for Disk 4 are lower since the disk 4 was affected by something (possibly overheated) and failed or was powered off for several hours (7 hours). My NAS is being started automatically at 16:00 and the error appeared at around 22:57 when I logged in via Web
    Nov  3 22:57:23	Bondarchuk	msg="There	89	local1	alert	msg="There is a RAID Degraded." note="" cat="Storage"
    Nov  3 22:57:45	Bondarchuk	msg="User	8e	local1	info	msg="User admin has logged in from Web!" note="User: admin" cat="Login"
    

    This is almost 7 hours. As I was back to my home at 22:57 I heard my NAS was beeping (constant beeping due to some error), and I logged in to Web Admin Panel. Seems that triggered syslog event.

    As I wrote before, plausible explanation is overheating as I moved my NAS in a covered area (not very clever of me) just few hours/days before. That's how my internal detective explains that to myself. :)

    Thank you for a very interesting analysis from your side. Very much appreciated.

Consumer Product Help Center