ffcv icon indicating copy to clipboard operation
ffcv copied to clipboard

Thread leak with FFCV x tqdm

Open ed1d1a8d opened this issue 2 years ago • 1 comments

I used FFCV with tqdm for long running job, and noticed it crashed due to using too many threads (~4000 threads had built up over 30 hours in a linear fashion, and the OS eventually stepped and forbid my process from creating any new threads).

image

Upon further investigation, I found that when FFCV is used with tqdm, there seems to be a thread-leak (i.e. new threads are created that never get deleted).

Here's a reproduction of the issue: https://gist.github.com/ed1d1a8d/424e5bc83325c93037cfe2de9e457a68

I'm curious if this is an issue with FFCV or an issue with tqdm, and if it is a known problem.

TL;DR Is seems like the following ways of using ffcv with tqdm are broken:

# This has a thread leak
for _ in tqdm(loader):
    pass

# This also has a thread leak
with tqdm(loader) as pbar:
    for _ in pbar:
        pass

but the following methods are OK:

# Without tqdm, there is no thread leak!
for _ in loader:
    pass

# Manual tqdm is also okay!
with tqdm(total=len(loader)) as pbar:
    for _ in loader:
        pbar.update(1)

ed1d1a8d avatar Jul 14 '22 21:07 ed1d1a8d

The only explanation I could find here is that tqdm somehow keeps a reference on the iterator which also extends Thread in FFCV. by using manual tqdm you are not giving a reference on the iterator to tqdm so it can't keep it not exhibiting the problem. It must therefore be a problem with tqdm. What happens if you .join() on the iterator after the iteration does it block forever ? If it doesn't block then it means the thread is completed and tqdm is just keeping a reference there for some sort of weird reason. You can also inspect the garbage collector to see who is actually holding reference to the object blocking its collection

GuillaumeLeclerc avatar Jul 16 '22 03:07 GuillaumeLeclerc