ffcv
ffcv copied to clipboard
Thread leak with FFCV x tqdm
I used FFCV with tqdm for long running job, and noticed it crashed due to using too many threads (~4000 threads had built up over 30 hours in a linear fashion, and the OS eventually stepped and forbid my process from creating any new threads).
data:image/s3,"s3://crabby-images/c230e/c230e66b2fcb99fa4143f901d3b84c7f4585707a" alt="image"
Upon further investigation, I found that when FFCV is used with tqdm, there seems to be a thread-leak (i.e. new threads are created that never get deleted).
Here's a reproduction of the issue: https://gist.github.com/ed1d1a8d/424e5bc83325c93037cfe2de9e457a68
I'm curious if this is an issue with FFCV or an issue with tqdm, and if it is a known problem.
TL;DR Is seems like the following ways of using ffcv with tqdm are broken:
# This has a thread leak
for _ in tqdm(loader):
pass
# This also has a thread leak
with tqdm(loader) as pbar:
for _ in pbar:
pass
but the following methods are OK:
# Without tqdm, there is no thread leak!
for _ in loader:
pass
# Manual tqdm is also okay!
with tqdm(total=len(loader)) as pbar:
for _ in loader:
pbar.update(1)
The only explanation I could find here is that tqdm somehow keeps a reference on the iterator which also extends Thread
in FFCV. by using manual tqdm you are not giving a reference on the iterator to tqdm so it can't keep it not exhibiting the problem. It must therefore be a problem with tqdm. What happens if you .join()
on the iterator after the iteration does it block forever ? If it doesn't block then it means the thread is completed and tqdm is just keeping a reference there for some sort of weird reason. You can also inspect the garbage collector to see who is actually holding reference to the object blocking its collection