RcppParallel icon indicating copy to clipboard operation
RcppParallel copied to clipboard

Memory leak

Open dipterix opened this issue 2 years ago • 6 comments

Found previous closed issues (#81 #175).

I think the tinythread is leaking as well. I don't have TBB installed, but running parallelFor leaks ~10MB per 800 threads (16KB/thread). Here are some information that might be useful:

1. Print out when threads are joined/detached

In function ttParallelFor, before threads[i]->join();, I added the following code to print out thread handle

std::cout << "mNotAThread: " << threads[i]->native_handle() << " " << threads[i]->native_handle() << "\n";

In tinythread.h, added printing information:

  // Create the thread
#if defined(_TTHREAD_WIN32_)
  mHandle = (HANDLE) _beginthreadex(0, 0, wrapper_function, (void *) ti, 0, &mWin32ThreadID);
#elif defined(_TTHREAD_POSIX_)
  if(pthread_create(&mHandle, NULL, wrapper_function, (void *) ti) != 0) {
    mHandle = 0;
  } else {
    std::cout << "Threading " << mHandle << "\n";
  }

#endif

Here's the result:

Threading 0x16bf6f000
Threading 0x16bffb000
Threading 0x16c087000
Threading 0x16c113000
Threading 0x16c19f000
Threading 0x16c22b000
Threading 0x16c2b7000
Threading 0x16c343000
mNotAThread: 0x16bf6f000 1
mNotAThread: 0x16bffb000 1
mNotAThread: 0x16c087000 1
mNotAThread: 0x16c113000 1
mNotAThread: 0x16c19f000 1
mNotAThread: 0x16c22b000 1
mNotAThread: 0x16c2b7000 1
mNotAThread: 0x16c343000 1
[1]  205.41996 -302.72300  -30.53115 -408.31099 -227.63973  292.03780
[7]   48.60666  108.68428

Since the printing lines execute before join(), and I could not find join() calls elsewhere. I guess this means the threads are not joined at all.

Even though join() function is indeed called explicitly. The mNotAThread is set to true before join(). The result is: joinable() returns false and the thread was not joined nor detached, hence memory leak.

2. When is mNotAThread set to true?

There are 3 places:

  • When pthread_create errored out
  • When the thread is marked as detached
  • In thread::wrapper_function, after calling ti->mFunction(ti->mArg);

The first two are pretty legit: if the thread does not exist, then there is no join/detach; if the thread is detached, then the resource will be collected back automatically when the thread ends. Neither will cause memory leak.

Therefore I think the culprit is the third one: when the thread starts to run mFunction, the mNotAThread is immediately set to true, and there is no join in between.

3. Trying to fix

Simply removing the two lines in thread::wrapper_function does not work. The program terminates immediately.

  lock_guard<mutex> guard(ti->mThread->mDataMutex);
  ti->mThread->mNotAThread = true;

This is because with mNotAThread set to false, inline thread::~thread() will get called and std::terminate() will shutdown the program. The code had been fine because std::terminate() never got a chance to be called (joinable() is always false).

To fix this issue, I replaced destructor with

inline thread::~thread()
{
  join();
  detach();
}

Since join does not change the flag, I choose to detach the thread (not sure if this is ok, never programed pthread before...).

I tried only call join() instead of both, surprisingly the memory leak persists. Not sure if it was caused by myself...

4. Potential issues

potential issues:

  • The thread will not call std::terminate(). So if the thread hangs, I don't know what behavior would be. However, the current code won't execute std::terminate() neither so I guess it's fine?
  • I'm using OSX, haven't tested on Linux nor Windows yet. Maybe on Windows, the behaviors are different... (such as having issues joining threads, or thread detaches/terminates before the joining)

I'm worried about stray threads... What should I do...

5. Other minor changes

To avoid taking long time to acquire the locker, I changed detach() and joinable() to:

inline bool thread::joinable() const
{
  // mDataMutex.lock();
  bool result = !mNotAThread;
  // mDataMutex.unlock();
  return result;
}

inline void thread::detach()
{
  bool isDetachable = false;
  mDataMutex.lock();
  if(!mNotAThread)
  {
    mNotAThread = true;
    isDetachable = true;
  }
  mDataMutex.unlock();

  if(isDetachable)
  {
#if defined(_TTHREAD_WIN32_)
    CloseHandle(mHandle);
#elif defined(_TTHREAD_POSIX_)
    pthread_detach(mHandle);
#endif
  }
}

Not sure if this would introduce side effects. Not sure if the following report makes sense.... Please help : )

==24919== HEAP SUMMARY:
==24919==     in use at exit: 52,156,933 bytes in 9,980 blocks
==24919==   total heap usage: 29,002 allocs, 19,022 frees, 88,950,783 bytes allocated
==24919== 
==24919== LEAK SUMMARY:
==24919==    definitely lost: 0 bytes in 0 blocks
==24919==    indirectly lost: 0 bytes in 0 blocks
==24919==      possibly lost: 0 bytes in 0 blocks
==24919==    still reachable: 52,156,933 bytes in 9,980 blocks
==24919==                       of which reachable via heuristic:
==24919==                         newarray           : 4,264 bytes in 1 blocks
==24919==         suppressed: 0 bytes in 0 blocks
==24919== Reachable blocks (those to which a pointer was found) are not shown.
==24919== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==24919== 
==24919== For lists of detected and suppressed errors, rerun with: -s
==24919== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

dipterix avatar Oct 06 '22 13:10 dipterix

Minor update:

I can find memory leak with:

  • Mac M1/M2, R-ARM 4.2.1
  • Intel Ubuntu 20.04.4 LTS, R 4.2.0 (docker container, host is intel Mac pro cheese shredder V1)

I cannot detect memory leak on

  • Intel Mac, R-x64_86 4.2.1

The proposed version does not show any memory leak on the above 3 platforms (only with tinyThread)

The way I tested is to

  1. open RStudio,
  2. run my code (calling a blank parallelFor internally) 100 times with microbenchmark,
  3. call gc()
  4. check the memory size using RStudio built-in memory report (see below)
  5. then repeat the step 2-4, check if
  • runtime is increasing
  • memory usage is increasing
  • gc() remain stable (make sure the leak is in C++, not in R)
image

After several runs:

image

dipterix avatar Oct 06 '22 15:10 dipterix

Do you have a readily reproducible example? (Something I can run with sourceCpp() directly)

Using the example at https://gallery.rcpp.org/articles/parallel-matrix-transform/, I can also see memory usage slowly but steadily rise, but strangely the macOS Instruments leak detector doesn't report any leaks...

kevinushey avatar Oct 06 '22 16:10 kevinushey

FWIW on my Ubuntu 22.04 VM (running on M1 macOS) Valgrind doesn't report that kind of memory leakage; just a small amount of thread-local stack space.

=5426== 
==5426== HEAP SUMMARY:
==5426==     in use at exit: 63,869,118 bytes in 16,862 blocks
==5426==   total heap usage: 41,914 allocs, 25,052 frees, 107,103,228 bytes allocated
==5426== 
==5426== 672 bytes in 2 blocks are possibly lost in loss record 151 of 1,534
==5426==    at 0x4869F34: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-arm64-linux.so)
==5426==    by 0x4010F63: calloc (rtld-malloc.h:44)
==5426==    by 0x4010F63: allocate_dtv (dl-tls.c:375)
==5426==    by 0x4011963: _dl_allocate_tls (dl-tls.c:634)
==5426==    by 0x4DBE087: allocate_stack (allocatestack.c:430)
==5426==    by 0x4DBE087: pthread_create@@GLIBC_2.34 (pthread_create.c:647)
==5426==    by 0xD772F2B: launch (thread_monitor.h:218)
==5426==    by 0xD772F2B: wake_or_launch (private_server.cpp:297)
==5426==    by 0xD772F2B: tbb::internal::rml::private_server::wake_some(int) (private_server.cpp:395)
==5426==    by 0x118188B3: spawn (task.h:1046)
==5426==    by 0x118188B3: offer_work (parallel_for.h:118)
==5426==    by 0x118188B3: execute<tbb::interface9::internal::start_for<tbb::blocked_range<long unsigned int>, RcppParallel::(anonymous namespace)::TBBWorker, const tbb::auto_partitioner>, tbb::blocked_range<long unsigned int> > (partitioner.h:249)
==5426==    by 0x118188B3: tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, RcppParallel::(anonymous namespace)::TBBWorker, tbb::auto_partitioner const>::execute() (parallel_for.h:139)
==5426==    by 0xD77E823: tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) (custom_scheduler.h:517)
==5426==    by 0xD77C64F: tbb::internal::generic_scheduler::local_spawn_root_and_wait(tbb::task*, tbb::task*&) (scheduler.cpp:699)
==5426==    by 0x1181855B: spawn_root_and_wait (task.h:779)
==5426==    by 0x1181855B: run (parallel_for.h:92)
==5426==    by 0x1181855B: parallel_for<tbb::blocked_range<long unsigned int>, RcppParallel::(anonymous namespace)::TBBWorker> (parallel_for.h:198)
==5426==    by 0x1181855B: operator() (TBB.h:77)
==5426==    by 0x1181855B: internal_run_and_wait<const RcppParallel::(anonymous namespace)::TBBParallelForExecutor> (task_group.h:104)
==5426==    by 0x1181855B: run_and_wait<RcppParallel::(anonymous namespace)::TBBParallelForExecutor> (task_group.h:209)
==5426==    by 0x1181855B: operator() (TBB.h:142)
==5426==    by 0x1181855B: tbb::interface7::internal::delegated_function<RcppParallel::(anonymous namespace)::TBBArenaParallelForExecutor, void>::operator()() const (task_arena.h:90)
==5426==    by 0xD77A22F: tbb::interface7::internal::task_arena_base::internal_execute(tbb::interface7::internal::delegate_base&) const (arena.cpp:943)
==5426==    by 0x1181B0AF: execute_impl<void, RcppParallel::(anonymous namespace)::TBBArenaParallelForExecutor> (task_arena.h:213)
==5426==    by 0x1181B0AF: execute<RcppParallel::(anonymous namespace)::TBBArenaParallelForExecutor> (task_arena.h:344)
==5426==    by 0x1181B0AF: RcppParallel::tbbParallelFor(unsigned long, unsigned long, RcppParallel::Worker&, unsigned long, int) (TBB.h:239)
==5426==    by 0x11819127: parallelFor (RcppParallel.h:42)
==5426==    by 0x11819127: parallelMatrixSqrt(Rcpp::Matrix<14, Rcpp::PreserveStorage>) (test.cpp:38)
==5426== 
==5426== 672 bytes in 2 blocks are possibly lost in loss record 152 of 1,534
==5426==    at 0x4869F34: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-arm64-linux.so)
==5426==    by 0x4010F63: calloc (rtld-malloc.h:44)
==5426==    by 0x4010F63: allocate_dtv (dl-tls.c:375)
==5426==    by 0x4011963: _dl_allocate_tls (dl-tls.c:634)
==5426==    by 0x4DBE087: allocate_stack (allocatestack.c:430)
==5426==    by 0x4DBE087: pthread_create@@GLIBC_2.34 (pthread_create.c:647)
==5426==    by 0xD772F2B: launch (thread_monitor.h:218)
==5426==    by 0xD772F2B: wake_or_launch (private_server.cpp:297)
==5426==    by 0xD772F2B: tbb::internal::rml::private_server::wake_some(int) (private_server.cpp:395)
==5426==    by 0xD7732CB: propagate_chain_reaction (private_server.cpp:157)
==5426==    by 0xD7732CB: tbb::internal::rml::private_worker::run() (private_server.cpp:257)
==5426==    by 0xD77336B: tbb::internal::rml::private_worker::thread_routine(void*) (private_server.cpp:219)
==5426==    by 0x4DBD5C7: start_thread (pthread_create.c:442)
==5426==    by 0x4E25D1B: thread_start (clone.S:79)
==5426== 
==5426== LEAK SUMMARY:
==5426==    definitely lost: 0 bytes in 0 blocks
==5426==    indirectly lost: 0 bytes in 0 blocks
==5426==      possibly lost: 1,344 bytes in 4 blocks
==5426==    still reachable: 63,867,774 bytes in 16,858 blocks
==5426==                       of which reachable via heuristic:
==5426==                         newarray           : 3,100,264 bytes in 3,001 blocks
==5426==         suppressed: 0 bytes in 0 blocks
==5426== Reachable blocks (those to which a pointer was found) are not shown.
==5426== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==5426== 
==5426== For lists of detected and suppressed errors, rerun with: -s
==5426== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

kevinushey avatar Oct 06 '22 16:10 kevinushey

My valgrind report is similar to yours: pthread_create@.... Basically some pthreads not joined nor detached.

Just in case of confusion, the code uses ravetools package on CRAN. This package does not use (nor have to use) the full RcppParallel because linking to TBB always have weird issues, also I (originally) thought the memory leak was caused by TBB, so I migrated the tinyThread part to my package.

The function to test is:

# install.packages("microbenchmark")
# remotes::install_github('dipterix/ravetools@2a46f7')
a = matrix(1:10,2)
microbenchmark::microbenchmark(ravetools:::columnMedian(a, TRUE))
  • This commit uses "as-is" tinythread from your package (https://github.com/dipterix/ravetools/tree/2a46f7de2915d1fdec5504061af6144ad7e0dec8)

  • The latest github version uses modified tinythread so you won't see memory leaks. If you want to reproduce the memory leak, please use the commit 2a46f7

  • Also the newest modified version does not work well on windows. Occasionally the R would hang (0.05% chances I guess?) So I think my version still has some flaws. Also I think the current RCppParallel works well on windows

  • Installing ravetools require fftw3, please try brew install fftw or brew install fftw3 before compiling from source

Save the above code to tmp.R and run

R -d "valgrind --tool=memcheck --leak-check=full" -f tmp.R

dipterix avatar Oct 06 '22 17:10 dipterix

Using the example at gallery.rcpp.org/articles/parallel-matrix-transform, I can also see memory usage slowly but steadily rise, but strangely the macOS Instruments leak detector doesn't report any leaks...

This is also the case for me: the memory usage slowly increases. I only got 3-10MB increase every 100-1000 runs

I guess the reason hides behind those "small" memory leaks. Because pthread_create will not release resources if not joined nor detached. Maybe the leak detector doesn't consider those resources "leaked"? since you still have changes to join even after the thread terminates (I guess). Therefore when the memory is leaked, it only reports pthread_create, not the thread resources.

(This is just my guess, I'm a newbie I might have said something stupid : )

dipterix avatar Oct 06 '22 18:10 dipterix

Alright, I rewind changes on Windows, so now the patch should no longer leak memory with tinyThread. (TBB may still get leaked. Haven't touched any code yet)

The pull request has been opened.

dipterix avatar Oct 06 '22 19:10 dipterix