RcppParallel icon indicating copy to clipboard operation
RcppParallel copied to clipboard

Fixing tinyThread memory leak

Open dipterix opened this issue 3 years ago • 4 comments

Fixing https://github.com/RcppCore/RcppParallel/issues/185

Avoid setting mNotAThread to true when using pthread to allow threads to join/detach. This will allow the resources to be collected upon termination of threads.

The leak was found on

  • OSX M2, R-4.2.1
  • Ubuntu 20, x64, R-4.2.0 (under container hosted by mac x86)
  • OSX x64, R-4.2.1 (Very small amount at a time)

The script has been tested on

  • Window x64, R 4.2.1
  • OSX M2, R-4.2.1
  • OSX x64, R-4.2.1
  • Ubuntu 20, x64, R-4.2.0 (under container hosted by mac x86)

Valgrind has been evaluated on with the latest rocker/r-devel-ubsan-clang under docker container (host OSX x86)

R -d "valgrind --tool=memcheck --leak-check=full" tmp.R

tmp.R has been given in the issue https://github.com/RcppCore/RcppParallel/issues/185

The memory check result:

==24919== HEAP SUMMARY:
==24919==     in use at exit: 52,156,933 bytes in 9,980 blocks
==24919==   total heap usage: 29,002 allocs, 19,022 frees, 88,950,783 bytes allocated
==24919== 
==24919== LEAK SUMMARY:
==24919==    definitely lost: 0 bytes in 0 blocks
==24919==    indirectly lost: 0 bytes in 0 blocks
==24919==      possibly lost: 0 bytes in 0 blocks
==24919==    still reachable: 52,156,933 bytes in 9,980 blocks
==24919==                       of which reachable via heuristic:
==24919==                         newarray           : 4,264 bytes in 1 blocks
==24919==         suppressed: 0 bytes in 0 blocks
==24919== Reachable blocks (those to which a pointer was found) are not shown.
==24919== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==24919== 
==24919== For lists of detected and suppressed errors, rerun with: -s
==24919== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

still reachable exists even if I do nothing at all so I think it's fixed? No indirectly lost nor possibly lost was reported

dipterix avatar Oct 06 '22 19:10 dipterix

Thanks -- I'll have to reserve some time to take a closer look.

We use tinythread in a couple other packages (e.g. reticulate) so if we take this change I'd like to propagate those changes to other packages as well.

kevinushey avatar Oct 07 '22 16:10 kevinushey

I have not had time to dig either but I would caution that we would probably want to see this in valgrind first.

(Only saying this because I helped a Rcpp user recently and while I would have to go back and check the details the issue of (casually observed in simulation runs return large result matrices repeatedly) the issue was the creation of lots of string objects which R (as far as I understand it) keeps in a different memory pool so the usual copy-on-write recycling did not hold. Ensuring numeric results only helped. So sometimes the 'leak' can be due to something else.)

eddelbuettel avatar Oct 07 '22 16:10 eddelbuettel

@eddelbuettel I understand your concern, and RcppParallel has existed for long time and lots of packages are depending on it. It's better to be conservative when making fundamental changes.

dipterix avatar Oct 07 '22 17:10 dipterix

Tl;DR:

For those who tracked this far. The memory leak in RcppParallel happens only in certain OSX and Linux. Windows users are not going to observe it. If you are using TBB, this memory leak may not happen as well (I don't have TBB installed)

The leak happens even if the worker operator member function does nothing. The amount of memory leak depends on the size of thread-local space, and how many threads get joined before being disregarded, which is small in most cases. However, the performance of the function becomes poor over time. For example, parallelMatrixSqrt in the RcppParallel example becomes 10x slower after around 2000 runs.

However, as the package creator's noted, the above observation may have just been a false positive due to potential bugs in my script. Please do your own research before using this PR.

dipterix avatar Oct 07 '22 17:10 dipterix

Sorry @dipterix; did you intend to close this?

kevinushey avatar Oct 17 '22 16:10 kevinushey

Oh sorry @kevinushey I thought we are not going to fix it so I closed it in case people feel confused. Please feel free to reopen if necessary.

dipterix avatar Oct 17 '22 21:10 dipterix

Thanks! Sorry, I still need to carve out time to properly review this but I do intend to...

kevinushey avatar Oct 18 '22 18:10 kevinushey

Thanks for insightful explanation.

If I understand correctly, the real issue is that the threads we're creating with tinythread are never being joined on close; is that correct?

This is kind of correct. The tinythreads are always joined before closing. However, in some situations, the inner pthreads are never being joined nor detached on close, hence resources used inside of the pthreads are not released. Please let me explain.

In the following code, after L123 and before L126, mNotAThread could be set to 1. To prove it, you can simply add a line between L126 and L127 to print out this flag. (If you cannot reproduce, try a smaller dataset such that operator method is fast)

https://github.com/RcppCore/RcppParallel/blob/1a2178117d8867b4040c9bfd788b393d602becc2/inst/include/RcppParallel/TinyThread.h#L120-L129

This is because when parallel worker is super fast (fast than pthread ramp-up overhead), some threads will have mNotAThread set to 1 before all threads are scheduled, then

https://github.com/RcppCore/RcppParallel/blob/1a2178117d8867b4040c9bfd788b393d602becc2/inst/include/tthread/tinythread.h#L874-L876

will alter the mNotAThread flag whenever the thread finishes its job.

Because the mNotAThread is set to 1, you can't simply call join() and assume the thread is join'ed. In tinythread.h, mNotAThread=1 will result in joinable() returning 0, hence the whole function join() does nothing at all.

https://github.com/RcppCore/RcppParallel/blob/1a2178117d8867b4040c9bfd788b393d602becc2/inst/include/tthread/tinythread.h#L919-L930

(Some comments)

Interestingly in tinythread.h L876, I saw the comment // The thread is responsible for freeing the startup information. I guess the tinythread authors wanted us to free the memory in the pthreads. Maybe I misunderstood, but that could be a bad idea as memory resources created within pthreads are never released (since they are nether join'ed nor detached)

In this article, I found the reason why the memory leak was ~10MB on my server: https://developer.ibm.com/tutorials/l-memory-leaks/

If you create a joinable thread but forget to join it, its resources or private memory are always kept in the process space and never reclaimed. Always join the joinable threads; by not joining them, you risk serious memory leaks.

For example, a thread on Red Hat Enterprise Linux (RHEL4), needs a 10MB stack, which means at least 10MB is leaked if you haven't joined it.

dipterix avatar Oct 27 '22 01:10 dipterix

For reference, here's what I see on my Ubuntu 22.04 VM...

> # install.packages("microbenchmark")
> # remotes::install_github('dipterix/ravetools@2a46f7')
> 
> a = matrix(1:10,2)
> for (i in 1:10)
+ 	ravetools:::columnMedian(a, TRUE)
> 
> 
==14726== 
==14726== HEAP SUMMARY:
==14726==     in use at exit: 53,384,540 bytes in 11,128 blocks
==14726==   total heap usage: 30,448 allocs, 19,320 frees, 92,903,085 bytes allocated
==14726== 
==14726== 21,168 bytes in 63 blocks are possibly lost in loss record 1,078 of 1,555
==14726==    at 0x4869F34: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-arm64-linux.so)
==14726==    by 0x4010F63: calloc (rtld-malloc.h:44)
==14726==    by 0x4010F63: allocate_dtv (dl-tls.c:375)
==14726==    by 0x4011963: _dl_allocate_tls (dl-tls.c:634)
==14726==    by 0x4DBE087: allocate_stack (allocatestack.c:430)
==14726==    by 0x4DBE087: pthread_create@@GLIBC_2.34 (pthread_create.c:647)
==14726==    by 0xDB4E9B7: thread (tinythread.h:901)
==14726==    by 0xDB4E9B7: TinyParallel::ttParallelFor(unsigned long, unsigned long, TinyParallel::Worker&, unsigned long) (TinyThread.h:124)
==14726==    by 0xDB50E07: parallelFor (TinyParallel.h:21)
==14726==    by 0xDB50E07: columnQuantile(SEXPREC*&, double const&, bool const&) (columnQuantile.cpp:158)
==14726==    by 0xDB511F3: columnMedian(SEXPREC*&, bool const&) (columnQuantile.cpp:193)
==14726==    by 0xDB440A7: _ravetools_columnMedian_try(SEXPREC*, SEXPREC*) (RcppExports.cpp:151)
==14726==    by 0xDB442AF: _ravetools_columnMedian (RcppExports.cpp:159)
==14726==    by 0x4984AD3: R_doDotCall (dotcode.c:601)
==14726==    by 0x49C495F: bcEval (eval.c:7673)
==14726==    by 0x49D6353: Rf_eval (eval.c:729)
==14726== 
==14726== LEAK SUMMARY:
==14726==    definitely lost: 0 bytes in 0 blocks
==14726==    indirectly lost: 0 bytes in 0 blocks
==14726==      possibly lost: 21,168 bytes in 63 blocks
==14726==    still reachable: 53,363,372 bytes in 11,065 blocks
==14726==                       of which reachable via heuristic:
==14726==                         newarray           : 4,264 bytes in 1 blocks
==14726==         suppressed: 0 bytes in 0 blocks
==14726== Reachable blocks (those to which a pointer was found) are not shown.
==14726== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==14726== 
==14726== For lists of detected and suppressed errors, rerun with: -s
==14726== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

I think I understand what's going on now. Basically, if you try to call join() on a thread that's already finished execution, we never get a chance to call pthread_join() (which would be responsible for cleaning up thread-local memory, I guess?) and so we run into this issue.

I think this implies we want to allow joining a thread even after it has finished execution -- that is, we should allow calls to pthread_join() even if the thread has finished running its associated code.

kevinushey avatar Oct 27 '22 05:10 kevinushey

Alternate fix was made in https://github.com/RcppCore/RcppParallel/pull/187. Thanks for pushing things forward @dipterix!

kevinushey avatar Nov 15 '22 19:11 kevinushey