Nas326 smbd ram usage nearly 100% during copy. NAS get unresponisve

Options
Daimonion666
Daimonion666 Posts: 15  Freshman Member
10 Comments Friend Collector

Hello

Once a week i'm copying a big amount of data (15-30GB onto my NAS326 which does have 2 6TB WDRED as JBOD.

Since the drives are nearly full 2TB free space left at the moment, i recognized that copying via windows share takes ages.

Copy process starts pretty normal but after transmitting the first files (size of each file has several GBs) the process gets stuck and i have to abort it and start it after several hours, because the share is not reachable through windows explorer.

Now i tried to dig into this problem and saw that smbd process starts to allocate all the available ram when starting the copy. When not finishing the copy before all ram is in use through the smbd process the system gets unresponsive. When this happens i have to wait hours before smbd releases its ram and i can continue copying the data.

Is there anything i can do to optimize smbd ram usage when copying big files?

installed Firmwareversion: V5.21(AAZF.12)

«1

All Replies

  • Mijzelf
    Mijzelf Posts: 2,642  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options

    I think you have a hardware problem, somewhere. Looking at your top screenshot, I see the 'Load avarage´ is sky-high, 45.8. This is roughly the mean number of threads which were waiting for CPU time at each task switch in the last minute. When number exceeds the number of CPU's, the system is in trouble. Further I see that a lot of processes have the 'D' STAT, this is a sleep state where the process is waiting for something, usually a disk action. And last, the CPU is spending 92.7% of it's time on io. I'm not sure, but I think that should be near zero. Most IO tasks are offloaded to dedicated hardware.

    So I think there are problems to get data from memory to disk. Have a look at the SMART status of your disks.

  • Daimonion666
    Daimonion666 Posts: 15  Freshman Member
    10 Comments Friend Collector
    Options

    I made this screenshot in the moment of copying the data. SMART values seems to be ok..

  • Daimonion666
    Daimonion666 Posts: 15  Freshman Member
    10 Comments Friend Collector
    Options

    Hello

    This is a screenshot from top when the system is in idle.

    Even there i see smbd -D. Question is why?

  • Mijzelf
    Mijzelf Posts: 2,642  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options

    That is normal. Smbd is responsible for serving 'windows shares', and it's sitting there, waiting for someone to connect. You can compare that to a Windows service, which is also always active.

    Maybe dmesg (the kernel log buffer) kan give a clue what is happening. Can you login over ssh, and catch the output of the command 'dmesg'? That has to be done after the problem has been present after the last reboot.

  • Daimonion666
    Daimonion666 Posts: 15  Freshman Member
    10 Comments Friend Collector
    edited April 2023
    Options

    Thanks for your answer. I did a reboot this morning and copied some data. Afterwards i did a dmesg.

    Because i don't want to delete anything i pasted it here: https://pastebin.com/Jxr5efhw

    I found the following lines interesting:

    Line 152:
    Bad block table found at page 131008, version 0x01
    Bad block table found at page 130944, version 0x01
    nand_read_bbt: bad block at 0x00000acc0000
    nand_read_bbt: bad block at 0x00000ea60000

    Can be a flash problem…
    What I find more interesting is the end of the output from line 491 on:

    EXT4-fs (dm-1): warning: mounting fs with errors, running e2fsck is recommended
    EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: usrquota,data=ordered,barrier=1
    EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
    Rounding down aligned max_sectors from 4294967295 to 4294967288
    …….
    EXT4-fs (dm-1): error count: 10
    EXT4-fs (dm-1): initial error at 1626038041: ext4_mb_generate_buddy:755
    EXT4-fs (dm-1): last error at 1676710927: ext4_mb_generate_buddy:755

    Shall i run e2fsck? Which options shall i use then?

    Thanks for your help @Mijzelf

    Regards Daimonion

  • Mijzelf
    Mijzelf Posts: 2,642  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options

    As there are filesystem errors, and no disk errors (according to SMART) it's a good idea to run e2fsck. Unfortunately that's not easy. E2fsck can only be run on an unmounted volume, and it's not easy to unmount it. But there is a work-around, by injecting code in the shutdown script. I wrote about it here. Alas you'll have to read around the html code, as the forum software meanwhile borked up the code.

    The resize2fs step is not needed, in your case.

    You can use the -p argument (Automatic repair (no questions)) or -y (Assume "yes" to all questions) and the device node. I think the device node is /dev/dm-1, but I can be wrong. You can look at the output of 'df -h' to see the device nodes of the mounted volumes. Of course you have to do that before unmounting.

  • Daimonion666
    Daimonion666 Posts: 15  Freshman Member
    10 Comments Friend Collector
    Options

    Thanks.

    Will try this today afternoon and come back with the result.

    Regards Thomas

  • Daimonion666
    Daimonion666 Posts: 15  Freshman Member
    10 Comments Friend Collector
    Options

    Short interim report

    e2fsck /dev/mapper/vg_cae2ed9e-lv_b3f36ee7

    works since several ours after e2fsck -p said that there are

    "/dev/mapper/vg_cae2ed9e-lv_b3f36ee7: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options)"

    This check found severals multiply-claimed blocks but and pass 1D was running for hours. "top" showed nearlly 100% CPU during these hours but fell to idle later on.

    Since there are only 3 files claimed by multiple blocks i'm thinking of deleting these files. Will this be a good idea?

    https://pastebin.com/nq23p9jw

  • Mijzelf
    Mijzelf Posts: 2,642  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options

    Since there are only 3 files claimed by multiple blocks i'm thinking of deleting these files. Will this be a good idea?

    That is no longer the case. According to your log the 'multiple claimed blocks' are cloned, so each file now has it's own blocks. Deleting does no harm, but according to the cluster enumeration in your log 2 files might be OK. It won't hurt to see if they are still usable.

    Are you sure pass 1D took that time, and not 1C? Unless the names are poorly choosen I'd expect pass 1D mainly consists of asking the owner if the shared blocks should be cloned, or the affected files should be deleted. As all relevant data was already gathered in pass 1C.

  • Daimonion666
    Daimonion666 Posts: 15  Freshman Member
    10 Comments Friend Collector
    edited April 2023
    Options

    Okay, after doing e2fsck -y 3 times i decided to doing a e2fsck without -y and let it delete the 3 files (as this are backup files i am able to copy it again) so i pressed n at the step where it asked if it should clone the files and after that e2fsck was able to regenerate the fs completly. Running e2fsck again after that it reproted me a clean filesystem.

    So, Thank you for that.

    Back to the main problem.
    After rebooting the NAS, I tried to copy some files and the problem still exists.
    smbd raises it's ram usage until 100% and then LoadAverage raises the processcounter. Finally the complete system gets stuck and takes some time until it get's responsive again.

    Even when the copy task is finished before ram usage gets that high it takes time that the ram is cleared again. (see screenshots)
    SMART values are okay so far. Maybe the drives are slow because they are 81% filled (8.8TB out of 10.8TB) but why the ram is not freed after the copy task is finished?

    I could live with a slow copy and take care of ram usage but after copying 10GB of Data i don't want to wait hours, after i am able to copy the next 10GB of data.

    Regards Daimonion

    Edit:
    after hours, the nas is in idle, smbd takes a huge amount of RAM:

Consumer Product Help Center