neolink icon indicating copy to clipboard operation
neolink copied to clipboard

Potential memory leak

Open Dinth opened this issue 2 years ago • 58 comments

Describe the bug Hi. I might have found a memory leak, but apologies - i am not really able to provide more details, been dealing with an IT disaster at home and my only access to the docker machine is via Java client connected to iDRAC. During last night, my pfSense died, together with DHCP server.

I can only provide some screenshots, as my docker machine lost dhcp lease and its offline. Screenshot from 2023-12-10 12-06-57

Neolink logs - hundreds of screens of this: image

Versions NVR software: Frigate Neolink software: (apologies, currently my docker machine is down, but its latest docker image as of 9/12/2023, maintained by watchtower)

Dinth avatar Dec 10 '23 12:12 Dinth

I'm running mine on Windows, and resorted to just using Task Scheduler to kill and restart the neolink service every 3 hours. Not sure the memory issues have been worked out, and unfortunately I haven't see the maintainer around here for a few weeks.

bwthor avatar Dec 11 '23 14:12 bwthor

I'm running in LXC container in a proxmox server. I still have a steady memory leak, that ends up crashing the container (4GB)

zeluisg avatar Jan 25 '24 18:01 zeluisg

I'm aware. I just don't have the time to deal with this at the moment. Will be awhile before my other commitments clear up and I can get back to coding on this one.

I'm suspecting this is happening on gstreamer side of things and I want to look into using a gstreamer buffer pool instead of constantly creating new buffers for each frame. But can't handle it yet

QuantumEntangledAndy avatar Jan 26 '24 02:01 QuantumEntangledAndy

I've posted this on the other issue too but thought I'd post here in case you aren't subscribed there. I setup valgrind in some docker containers to test things:

Camera and Client Setup

  • E1
  • Substream
  • Connected over RTSP with ffmpeg for 27 minutes

Debian Bookworm gstreamer1.0-x (1.22.0-3+deb12u1)

Screenshot 2024-04-08 at 15 16 17

Debian Sid gstreamer1.0-x (1.24.1-1 and others)

Screenshot 2024-04-08 at 15 16 31

It seems to be stable at 4MB. Are there any more setup details that could help me find out what is causing this. How are you connecting etc?

QuantumEntangledAndy avatar Apr 08 '24 08:04 QuantumEntangledAndy

Hey. I have just restarted neolink container and noticed that its not working again and takes over 1GB of RAM. After the restart it was using 60mb of ram but already grew to 100mb after only a couple of minutes.

Ive got one Argus 2 camera and im hooking up neolink to Frigate NVR.

image

Dinth avatar Apr 08 '24 08:04 Dinth

After aroundf an hour image

Dinth avatar Apr 08 '24 09:04 Dinth

Perhaps this is frigate creating many clients connections. There was a similar issue elsewhere with the connections from frigate not closing fully and frigate just opening more and more connections. Can't remember what we did to fix that.

I might spin up a frigate docker to test this against.

QuantumEntangledAndy avatar Apr 08 '24 10:04 QuantumEntangledAndy

As another thing to try I could wrote a modified Docker that dumps the valgrind info. Maybe you could you run it?

QuantumEntangledAndy avatar Apr 08 '24 10:04 QuantumEntangledAndy

I am happy to help with testing, but i would greatly appreciate if you could upload it as a new branch (so i can just replace the image source in Portainer), as im on a course this week

Dinth avatar Apr 08 '24 10:04 Dinth

image after 10h

Dinth avatar Apr 08 '24 18:04 Dinth

I can also easily recreate this problem. I currently have a 3G memory limit on the container, and the container gets killed roughly every ~2-3 hours.

If you need any help to collect more information on this, I'm more than happy to help.

fleaz avatar Apr 08 '24 22:04 fleaz

@Dinth What is the architecture of your Portainer machine? Can I build just the x86_64 or am I going to need to go the extra mile and build arm as well?

QuantumEntangledAndy avatar Apr 09 '24 04:04 QuantumEntangledAndy

@Dinth What is the architecture of your Portainer machine? Can I build just the x86_64 or am I going to need to go the extra mile and build arm as well?

Im on x86_64, many thanks!

Dinth avatar Apr 09 '24 05:04 Dinth

Ok so the docker will be here (in half an hour)

docker pull quantumentangledandy/neolink:test-valgrind

The binding of the config and ports are the same as usual

BUT there is an extra volume you should mount /valgrind/, the valgrind output goes in there (specifically /valgrind/massif.out but binding the whole dir is fine too)

Valgrind output is only created when the app exits normally not when killed by docker stop (or other forcefull stopping methods) so to help with that I have added a timeout of 30mins after which it will stop itself and write the file I need which will be massif.out. If you really need to stop it early send a SIGINT (not sure how portainer can do that)

p.s. the docker image is still building here https://github.com/QuantumEntangledAndy/neolink/actions/runs/8611052761, please wait half hour or so until it is ready before you pull it

QuantumEntangledAndy avatar Apr 09 '24 06:04 QuantumEntangledAndy

P.s. the docker build succeeded. Please run when you can and post the massif.out when you can

QuantumEntangledAndy avatar Apr 09 '24 08:04 QuantumEntangledAndy

Heres the generated file massif.out.zip

Dinth avatar Apr 09 '24 11:04 Dinth

Screenshot 2024-04-09 at 19 17 39

This is the memory profile of that massif.out

QuantumEntangledAndy avatar Apr 09 '24 12:04 QuantumEntangledAndy

Hey. Its a really weird thing, but it seems that since i moved to the test-valgrind branch, my neolink RAM usage has stopped uncontrollably growing. I have restarted the container twice since and it's stops growing at 400mb (as shown by Portainer). I think i still need some more time to test that (since the ram usage was not growing immediatelly but after some time).

Dinth avatar Apr 09 '24 13:04 Dinth

ahh thats because it sigterms:

2024-04-09T12:58:42.005180861Z ==9== 

2024-04-09T12:58:42.005248333Z ==9== Process terminating with default action of signal 15 (SIGTERM)

2024-04-09T12:58:42.005593091Z ==9==    at 0x558D719: syscall (syscall.S:38)

2024-04-09T12:58:42.005701720Z ==9==    by 0x1D40836: parking_lot_core::thread_parker::imp::ThreadParker::futex_wait (linux.rs:112)

2024-04-09T12:58:42.005734066Z ==9==    by 0x1D405E3: <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park (linux.rs:66)

2024-04-09T12:58:42.005793619Z ==9==    by 0x1D49C9D: parking_lot_core::parking_lot::park::{{closure}} (parking_lot.rs:635)

2024-04-09T12:58:42.005846198Z ==9==    by 0x1D4882E: with_thread_data<parking_lot_core::parking_lot::ParkResult, parking_lot_core::parking_lot::park::{closure_env#0}<parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#0}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#1}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#2}>> (parking_lot.rs:207)

2024-04-09T12:58:42.005869718Z ==9==    by 0x1D4882E: parking_lot_core::parking_lot::park (parking_lot.rs:600)

2024-04-09T12:58:42.005889943Z ==9==    by 0x1D4EFD8: parking_lot::condvar::Condvar::wait_until_internal (condvar.rs:333)

2024-04-09T12:58:42.005942304Z ==9==    by 0x1C3484D: parking_lot::condvar::Condvar::wait (condvar.rs:256)

2024-04-09T12:58:42.006016975Z ==9==    by 0x1CED5BB: tokio::loom::std::parking_lot::Condvar::wait (parking_lot.rs:149)

2024-04-09T12:58:42.006041703Z ==9==    by 0x1C636F4: tokio::runtime::park::Inner::park (park.rs:116)

2024-04-09T12:58:42.006061069Z ==9==    by 0x1C642D6: tokio::runtime::park::CachedParkThread::park::{{closure}} (park.rs:254)

2024-04-09T12:58:42.006077276Z ==9==    by 0x1C64475: tokio::runtime::park::CachedParkThread::with_current::{{closure}} (park.rs:268)

2024-04-09T12:58:42.006175963Z ==9==    by 0x1C21327: std::thread::local::LocalKey<T>::try_with (local.rs:286)

2024-04-09T12:58:42.006196923Z ==9==    by 0x1C64415: tokio::runtime::park::CachedParkThread::with_current (park.rs:268)

2024-04-09T12:58:42.006212884Z ==9==    by 0x1C6425D: tokio::runtime::park::CachedParkThread::park (park.rs:254)

2024-04-09T12:58:42.006268175Z ==9==    by 0x976218: tokio::runtime::park::CachedParkThread::block_on (park.rs:285)

2024-04-09T12:58:42.006436287Z ==9==    by 0xA9C604: tokio::runtime::context::blocking::BlockingRegionGuard::block_on (blocking.rs:66)

2024-04-09T12:58:42.006534954Z ==9==    by 0x397195: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}} (mod.rs:87)

2024-04-09T12:58:42.006628761Z ==9==    by 0xBD5663: tokio::runtime::context::runtime::enter_runtime (runtime.rs:65)

2024-04-09T12:58:42.006654879Z ==9==    by 0x397150: tokio::runtime::scheduler::multi_thread::MultiThread::block_on (mod.rs:86)

2024-04-09T12:58:42.006696927Z ==9==    by 0xA9FE94: tokio::runtime::runtime::Runtime::block_on (runtime.rs:351)

2024-04-09T12:58:42.006766288Z ==9==    by 0x909AAF: neolink::main (main.rs:85)

2024-04-09T12:58:42.006823569Z ==9==    by 0x98384A: core::ops::function::FnOnce::call_once (function.rs:250)

2024-04-09T12:58:42.006968811Z ==9==    by 0x64D28D: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:155)

2024-04-09T12:58:42.007054233Z ==9==    by 0x449190: std::rt::lang_start::{{closure}} (rt.rs:166)

2024-04-09T12:58:42.007226028Z ==9==    by 0x24E21F0: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:284)

2024-04-09T12:58:42.007281819Z ==9==    by 0x24E21F0: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:554)

2024-04-09T12:58:42.007343138Z ==9==    by 0x24E21F0: try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:518)

2024-04-09T12:58:42.007402426Z ==9==    by 0x24E21F0: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:142)

2024-04-09T12:58:42.007460927Z ==9==    by 0x24E21F0: {closure#2} (rt.rs:148)

2024-04-09T12:58:42.007525706Z ==9==    by 0x24E21F0: do_call<std::rt::lang_start_internal::{closure_env#2}, isize> (panicking.rs:554)

2024-04-09T12:58:42.007574276Z ==9==    by 0x24E21F0: try<isize, std::rt::lang_start_internal::{closure_env#2}> (panicking.rs:518)

2024-04-09T12:58:42.007616162Z ==9==    by 0x24E21F0: catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> (panic.rs:142)

2024-04-09T12:58:42.007658013Z ==9==    by 0x24E21F0: std::rt::lang_start_internal (rt.rs:148)

2024-04-09T12:58:42.007672256Z ==9==    by 0x449169: std::rt::lang_start (rt.rs:165)

2024-04-09T12:58:42.007696468Z ==9==    by 0x909B5D: main (in /usr/local/bin/neolink)

2024-04-09T12:58:42.009966184Z ==9== 

2024-04-09T12:59:03.310544775Z ==9== Massif, a heap profiler

2024-04-09T12:59:03.310618694Z ==9== Copyright (C) 2003-2017, and GNU GPL'd, by Nicholas Nethercote

2024-04-09T12:59:03.310633571Z ==9== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info

2024-04-09T12:59:03.310646373Z ==9== Command: /usr/local/bin/neolink rtsp --config /etc/neolink.toml

2024-04-09T12:59:03.310658339Z ==9== 

Dinth avatar Apr 09 '24 13:04 Dinth

That sigterm is expected. It's the way the timeout command signals the app to stop. Nothing wrong with it at all

QuantumEntangledAndy avatar Apr 09 '24 14:04 QuantumEntangledAndy

Can you try pulling latest master branch of the docker? Perhaps something very recent fixed it.

QuantumEntangledAndy avatar Apr 09 '24 14:04 QuantumEntangledAndy

On another hand, i can see that youre measuring the memory usage of a neolink process, but is it possible that:

  • Multiple neolink processes are being spawned?
  • something else in the container is causing memory clog?

I will try to check it, but i have already updated to Docker 26 and something is broken with exec command. Will get back to you on this.

Regarding the master branch, i have been on :latest with Watchtower automatically updating neolink

Dinth avatar Apr 09 '24 14:04 Dinth

Nope will be single process. Even if children were spawned valgrind would track them too.

There are two main changes here

  • ulimit -n 1024
  • using the debug build not the release

It's possible the release build breaks something with its code optimisations.

QuantumEntangledAndy avatar Apr 09 '24 15:04 QuantumEntangledAndy

I could try to run valgrind on the release build. It will just make figuring out what is wrong much harder without the debug symbols

QuantumEntangledAndy avatar Apr 09 '24 15:04 QuantumEntangledAndy

Anyways I'm off to sleep now so I'll do that tomorrow

QuantumEntangledAndy avatar Apr 09 '24 15:04 QuantumEntangledAndy

Ive just been running the :latest release for 40 minus and top shows: image

Looks like its actually neolink process using that memory

smaps dump: https://paste.ubuntu.com/p/jd7W6rGDw2/

Dinth avatar Apr 09 '24 16:04 Dinth

I'm building the realease version of the valgrind docker here https://github.com/QuantumEntangledAndy/neolink/actions/runs/8624926076 should be ready soon

QuantumEntangledAndy avatar Apr 10 '24 02:04 QuantumEntangledAndy

Alright the valgrind docker is ready, I will test it too

QuantumEntangledAndy avatar Apr 10 '24 02:04 QuantumEntangledAndy

This is my memory profile on release build Screenshot 2024-04-10 at 10 24 49

Shame I can't replicate this

QuantumEntangledAndy avatar Apr 10 '24 03:04 QuantumEntangledAndy

Could you modify Valgrind build to work for lets say 2 hours before sigterming?

Dinth avatar Apr 10 '24 05:04 Dinth