Nas 540 broken after power-outage

Options
Fever
Fever Posts: 7
First Comment
edited April 2022 in Personal Cloud Storage
I had a power outage and after this incidend my NAS 540 kept beeping.
Then i wanted to try to fix the NAS by using Mijzelf's RescueSticks (Thank you at this point for all the amazing work you are doing here). After that it didnt boot at all.
On debugging with a serial connection i was able to see that i get a "Bad header checksum".
I was able to fix that with tftp and getting the bootloader right again.
Now when i try to boot the NAS again without HDD's it seems to be working.
But if i want to boot with the drives, it wont start fully.
It seems to indicate broken hard drives but that doesnt make sense to me to have 4 broken drives ("failed command: READ FPDMA QUEUED") at once and not even be able to start the system?



Accepted Solution

  • Mijzelf
    Mijzelf Posts: 2,607  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Answer ✓
    Options
    Very unlucky for me as I really wanted to have RAID5 to avoid this exact scenario.

    Does that mean you don't have a backup? The real purpose of raid5 is availability, but as you left the box for 6 months I suppose that was not the goal.

    Don't know what is wrong with those disks, but if you manage to make a bitwise copy of only one of them, using ddrescue or something like that, you might be able to save your data.

    I wrote more about ddrescue in this thread.

«1

All Replies

  • Fever
    Fever Posts: 7
    First Comment
    Options
    More logs:
    This one is a normal boot without hard drives:
    https://pastebin.com/KDwccPvt

    This one with the rescure stick without hard drives:
    https://pastebin.com/WQyKMjYp

    This one with the rescure stick with hard drives:
    https://pastebin.com/wmwYfnSL


  • Mijzelf
    Mijzelf Posts: 2,607  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options
    I agree that the odds that all disks together died are low. In your log 'only' ata2 and ata4 show this problem. When you exchange disk 1&2 and disk 3&4, does the problem follow the disks? In that case indeed 2 disks are bad.
    If it doesn't follow the disks, does the box boot when you remove disk 2 & 4?
  • CaspersonC
    Options
    This works really well for us, thank you! Facing same issue here. Help is appreciated.

  • Fever
    Fever Posts: 7
    First Comment
    edited October 2022
    Options
    Mijzelf said:
    I agree that the odds that all disks together died are low. In your log 'only' ata2 and ata4 show this problem. When you exchange disk 1&2 and disk 3&4, does the problem follow the disks? In that case indeed 2 disks are bad.
    If it doesn't follow the disks, does the box boot when you remove disk 2 & 4?

    Mijzelf, I am so sorry to not reply to your answer. I did let the summer get me and left the NAS sitting in the corner.
    Your observation seems to be on point. After swapping discs 1&2 and 3&4, I seem to get the same errors, but on ata1 and ata3. This is no good for my data, but I assume that means I got two broken discs at once. Very unlucky for me as I really wanted to have RAID5 to avoid this exact scenario.
    I attached the log in case you want to check again.

  • Mijzelf
    Mijzelf Posts: 2,607  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Answer ✓
    Options
    Very unlucky for me as I really wanted to have RAID5 to avoid this exact scenario.

    Does that mean you don't have a backup? The real purpose of raid5 is availability, but as you left the box for 6 months I suppose that was not the goal.

    Don't know what is wrong with those disks, but if you manage to make a bitwise copy of only one of them, using ddrescue or something like that, you might be able to save your data.

    I wrote more about ddrescue in this thread.

  • Fever
    Fever Posts: 7
    First Comment
    Options
    Mijzelf said:
    Very unlucky for me as I really wanted to have RAID5 to avoid this exact scenario.

    Does that mean you don't have a backup? The real purpose of raid5 is availability, but as you left the box for 6 months I suppose that was not the goal.

    Don't know what is wrong with those disks, but if you manage to make a bitwise copy of only one of them, using ddrescue or something like that, you might be able to save your data.

    I wrote more about ddrescue in this thread.


    Yes, I thought that RAID5 is backup enough for my use case. I did use the NAS a lot when it was working, but did not bother when it was not running.
    Surely I would use it more if the firmware would be better, for which you offer a lot of solutions.
    I will try to read more into the logs i got.
    The confusing part here for me is that the web interface will not load if i have HDD 1 in slot 1.
    If i put for example 2134 into the NAS it boots but shows no drives ect like in the post with ddsecure.

    And i ordered a new drive and will check if ddsecure can help me somehow.
    If you have any other tipps or tests i can do, i would be so happy :-)
  • Mijzelf
    Mijzelf Posts: 2,607  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options
    The confusing part here for me is that the web interface will not load if i have HDD 1 in slot 1. If i put for example 2134 into the NAS it boots but shows no drives ect like in the post with ddsecure.
    I think it has something to do with timing. The webinterface runs from harddisk. At install a firmware partition is created, on which a compressed filesystem blob stored in flash is extracted. When no harddisk is available, a ramdrive is created on which the blob is extracted.
    When the broken disk is in slot 1, the firmware does not see quick enough there is a problem, and so doesn't create that ramdrive. When the disk is in slot 2, the raid header in slot 1 tells there is a problem, and so there is time for the ramdisk.
    Or something like that. The firmware is not really bulletproof when it comes to errors.
    BTW, it's ddrescue, and not ddsecure.
  • Fever
    Fever Posts: 7
    First Comment
    Options
    Mijzelf said:
    The confusing part here for me is that the web interface will not load if i have HDD 1 in slot 1. If i put for example 2134 into the NAS it boots but shows no drives ect like in the post with ddsecure.
    I think it has something to do with timing. The webinterface runs from harddisk. At install a firmware partition is created, on which a compressed filesystem blob stored in flash is extracted. When no harddisk is available, a ramdrive is created on which the blob is extracted.
    When the broken disk is in slot 1, the firmware does not see quick enough there is a problem, and so doesn't create that ramdrive. When the disk is in slot 2, the raid header in slot 1 tells there is a problem, and so there is time for the ramdisk.
    Or something like that. The firmware is not really bulletproof when it comes to errors.
    BTW, it's ddrescue, and not ddsecure.
    If I understand that right, i need to ask you for the compiled ddrescue file?
    Can you PM me that link you mention in the other threads?
  • Fever
    Fever Posts: 7
    First Comment
    edited October 2022
    Options
    After reading a bit more i did now also check with mdadm:
    I i understand that correctly, my 4th drive seems to be broken but the rest is ok. I could try to recreate the raid with missing /dev/sdd3 right? 


    ~ # mdadm --examine /dev/sd[abcd]3
    /dev/sda3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 6c9f75bc:413133d9:db34176b:a5f25aae
               Name : NAS540:2  (local to host NAS540)
      Creation Time : Sat Mar 12 15:10:28 2016
         Raid Level : raid5
       Raid Devices : 4

     Avail Dev Size : 11712782336 (5585.09 GiB 5996.94 GB)
         Array Size : 17569172928 (16755.27 GiB 17990.83 GB)
      Used Dev Size : 11712781952 (5585.09 GiB 5996.94 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : 9b3cc732:bf5ce9e1:985834fb:039acca0

        Update Time : Fri Oct 21 00:18:39 2022
           Checksum : 9ec3787e - correct
             Events : 31189

             Layout : left-symmetric
         Chunk Size : 64K

       Device Role : Active device 0
       Array State : A.A. ('A' == active, '.' == missing)
    /dev/sdb3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 6c9f75bc:413133d9:db34176b:a5f25aae
               Name : NAS540:2  (local to host NAS540)
      Creation Time : Sat Mar 12 15:10:28 2016
         Raid Level : raid5
       Raid Devices : 4

     Avail Dev Size : 11712782336 (5585.09 GiB 5996.94 GB)
         Array Size : 17569172928 (16755.27 GiB 17990.83 GB)
      Used Dev Size : 11712781952 (5585.09 GiB 5996.94 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : 1a92a04e:9edf8c1b:5ca9cebe:9bb47bcb

        Update Time : Fri Oct 21 00:17:57 2022
           Checksum : e28cd5e4 - correct
             Events : 31188

             Layout : left-symmetric
         Chunk Size : 64K

       Device Role : Active device 1
       Array State : AAA. ('A' == active, '.' == missing)
    /dev/sdc3:
              Magic : a92b4efc
            Version : 1.2
        Feature Map : 0x0
         Array UUID : 6c9f75bc:413133d9:db34176b:a5f25aae
               Name : NAS540:2  (local to host NAS540)
      Creation Time : Sat Mar 12 15:10:28 2016
         Raid Level : raid5
       Raid Devices : 4

     Avail Dev Size : 11712782336 (5585.09 GiB 5996.94 GB)
         Array Size : 17569172928 (16755.27 GiB 17990.83 GB)
      Used Dev Size : 11712781952 (5585.09 GiB 5996.94 GB)
        Data Offset : 262144 sectors
       Super Offset : 8 sectors
              State : clean
        Device UUID : b1c807f7:c9aa530f:644f2e15:f7e1fec0

        Update Time : Fri Oct 21 00:18:39 2022
           Checksum : ca9a915f - correct
             Events : 31189

             Layout : left-symmetric
         Chunk Size : 64K

       Device Role : Active device 2
       Array State : A.A. ('A' == active, '.' == missing)
    mdadm: No md superblock detected on /dev/sdd3.


    I tried to recreate the array but got:

    ~ # cat /proc/mdstat
    Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
    md2 : active raid5 sda3[0] sdc3[2] sdb3[1](F)
          17569172928 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/2] [U_U_]

    md1 : active raid1 sda2[0] sdc2[2] sdb2[1]
          1998784 blocks super 1.2 [4/3] [UUU_]

    md0 : active raid1 sda1[4] sdc1[2]
          1997760 blocks super 1.2 [4/2] [U_U_]

    unused devices: <none>
    ~ # mdadm --stop /dev/md2
    mdadm: Cannot get exclusive access to /dev/md2:Perhaps a running process, mounted filesystem or active volume group?


  • Mijzelf
    Mijzelf Posts: 2,607  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options
    I pm'd you a link to a compatible ddrescue. About the re-creation of the array, I suppose you mean re-assembling? Your headers show a creation time in 2016, which would be changed when you re-created them.
    As you can see sdb3 shows an array state of AAA., while both sda3 and sdc3 show an A.A. . So sda and sdc tell sdb was dropped. Sdb doesn't 'know' about that, because he wasn't updated after being dropped. Mdadm won't put sdb back in the array, because sda and sdc disagree.
    Maybe there is a way to force sdb back in the existing array, but I'm not aware of that. The only way I know is to create a new array around the existing payload.
    Having said that, sdb3 is updated last night, so it looks like it *was* back in the array at Fri Oct 21 00:17:57 2022, and was dropped again at Fri Oct 21 00:18:39 2022.
    How did you manage to do so?
    Anyway, when it was dropped within a minute, that disk has serious problems, and so it's not a good candidate for a healthy array. A copy is needed.


Consumer Product Help Center