nscp icon indicating copy to clipboard operation
nscp copied to clipboard

Memory Leak on Versions >= NSClient++ 0.5.2.35 2018-01-28

Open infraweavers opened this issue 7 years ago • 11 comments
trafficstars

Issue and Steps to Reproduce

Upgrade from NSClient++ 0.5.1.31 2017-06-26 to NSClient++ 0.5.2.35 2018-01-28

Still exists in 0.5.2.41 2018-04-26

Expected Behavior

Memory usage should remain static

Actual Behavior

Memory usage grows over time roughly in correlation with number of checks Unresolved by upgrade to NSClient++ 0.5.2.39 2018-02-04

Details

  • OS and Version: Windows Server 2008, 2008r and 2012 r2
  • Checking from: Open Monitoring Distribution running naemon core and gearman
  • Checking with: check_nrpe, check_nt, check_nsc_web

Additional Details

NSClient++ log: Log is empty

Graphs

Memory leak is demonstrated in graphs here: https://imgur.com/a/APAYxAR

In the next week or so, I'll set-up NSClient++ on an additonal box and test different types of checks as to try and narrow down the scope of the issue

infraweavers avatar May 09 '18 11:05 infraweavers

Since posting, I realise we are using check_nrpe for a very small number of checks and not ubiquitously across all Windows hosts, so that's ruled out.

I've now got three test boxes now running to try and narrow down the scope of the problem.

  1. Agressively running 3 nsc web checks + check nt for nsclient version and memory usage on 1 minute interval
  2. Agressively running 3 check nt checks + check nt for nsclient version and memory usage on 1 minute interval
  3. Control solely running check nt for nsclient version and memory usage on 1 minute interval

I will share findings in due course.

infraweavers avatar May 09 '18 13:05 infraweavers

The results are in. It's fairly conclusively an issue in check_nsc_web

https://imgur.com/a/nc2vFHp

I'll see what I can do to help narrow down the problem further still.

infraweavers avatar May 10 '18 08:05 infraweavers

When you say an issue in check_nsc_web do you mean a problem when check_nsc_web is used? I.e. there's a memory leak in NSClient's REST API?

mintsoft avatar May 12 '18 15:05 mintsoft

@mintsoft yes, a problem when check_nsc_web with NSCClients's REST API.

Over the weekend we left two tests running, one invoking check_ok and the other invoking check_critical and the other invoking check_ok via ceck_nsc_web (and thus using NSClient's REST API)

Both resulted in NSClient leaking memory in the same manner as other checks.

infraweavers avatar May 14 '18 13:05 infraweavers

To confirm the nslient.ini when being tested was the stock configuration:

; alias_mem - Alias for alias_mem. To configure this item add a section called: /settings/external scripts/alias/alias_mem
alias_mem = checkMem MaxWarn=80% MaxCrit=90% ShowAll=long type=physical type=virtual type=paged type=page

the commands being used were:

Leaks memory: /omd/sites/default/lib/nagios/plugins/check_nsc_web -p password -u https://servername:8443 -k alias_mem

Doesn't leak memory: /omd/sites/default/lib/nagios/plugins/check_nt -H servername -p 12489 -v MEMUSE -w 80 -c 90

The memory leak in check_nsc_web is very slow, however is definitely present at a rate of about 360 bytes/minute on our production servers.

infraweavers avatar Aug 24 '18 09:08 infraweavers

Just to keep updated, we've rolled out 0.5.2.41 company wide to fix https://github.com/mickem/nscp/issues/550 we've had it on a few servers for nearly a year and you can see the background memory leak:

image

Our workflow is when NSClient hits 50MB we restart the service to workaround the problem; we only use check_nsc_web and check_nt against these servers and there's nothing in the nsclient.log

infraweavers avatar Oct 30 '18 08:10 infraweavers

I've started looking at this as I'm seeing the same behaviour as we convert our nrpe checks to use the REST api under 0.5.3.4

A quick run with valgrind --tool=memcheck --log-file="valgrind.log" --leak-check=full --show-leak-kinds=all -v ./nscp test

then running 50 times: ./check_nsc_web -k -p "pass" -u "https://localhost:8443" check_version

Shows:

==26571== 50 errors in context 1 of 31:
==26571== Thread 24:
==26571== Syscall param socketcall.sendto(msg) points to uninitialised byte(s)
==26571==    at 0x763495B: send (send.c:31)
==26571==    by 0xCA5F567: mg_broadcast (mongoose.c:3054)
==26571==    by 0xCA4F1B8: Mongoose::ServerImpl::request_reply_async(unsigned long, std::string) (ServerImpl.cpp:136)
==26571==    by 0xCA4F48F: Mongoose::request_job::run() (ServerImpl.cpp:291)
==26571==    by 0xCA50BB4: Mongoose::ServerImpl::request_thread_proc(int) (ServerImpl.cpp:90)
==26571==    by 0x4E2E56: operator() (function_template.hpp:767)
==26571==    by 0x4E2E56: void has_threads::runThread<boost::function<void ()> >(boost::function<void ()>, boost::thread*) (has-threads.hpp:122)
==26571==    by 0x4E2C46: operator() (mem_fn_template.hpp:280)
==26571==    by 0x4E2C46: operator()<boost::_mfi::mf2<void, has_threads, boost::function<void()>, boost::thread*>, boost::_bi::list0> (bind.hpp:392)
==26571==    by 0x4E2C46: operator() (bind_template.hpp:20)
==26571==    by 0x4E2C46: boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf2<void, has_threads, boost::function<void ()>, boost::thread*>, boost::_bi::list3<boost::_bi::value<has_threads*>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<boost::thread*> > > >::run() (thread.hpp:117)
==26571==    by 0x52C6A49: ??? (in /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0)
==26571==    by 0x762D183: start_thread (pthread_create.c:312)
==26571==    by 0x794103C: clone (clone.S:111)
==26571==  Address 0x184938e4 is on thread 24's stack
==26571==  in frame #1, created by mg_broadcast (mongoose.c:3038)

It would appear that each invocation into the REST API results in an error in that stacktrace

mintsoft avatar Nov 04 '18 19:11 mintsoft

+1 for this problem.

mokrates avatar Jul 06 '19 15:07 mokrates

Another +1 for this problem. With 79 Windows servers and I have to restart NSClient++ on 2 or 3 of them each week.

biscuitNinja avatar Oct 02 '19 10:10 biscuitNinja

@mickem is there any chance of getting this resolved? Seems pretty widespread!

mintsoft avatar Oct 05 '19 21:10 mintsoft

are there any news about this problem?

joschi99 avatar Mar 13 '22 15:03 joschi99