batchgenerators icon indicating copy to clipboard operation
batchgenerators copied to clipboard

Slowdown with latest release

Open justusschock opened this issue 6 years ago • 16 comments

Hi,

I just wanted to let you know, that I have some issues with the latest release.

The issues are mainly, that there is a massive slowdown in our CI/CD (from about 45-50 minutes for all jobs in the matrix up to 14 hours or more).

Here is a build with the latest release (0.19.4, installed from PyPi) and here is the same build with release 0.19.3 (installed from PyPi). These builds are completely identical (besides the batchgenerators version).

Unfortunately I did not have time to pinpoint the error (yet).

Best, Justus

justusschock avatar Sep 09 '19 12:09 justusschock

Hi, thanks for letting me know! I did not observe any kind of reduction in speed. It would be great if you could compile a minimalist example where I can reproduce this behavior. Best, Fabian

FabianIsensee avatar Sep 09 '19 12:09 FabianIsensee

I'll try, but I don't think, this will be that easy, since I could not reproduce it with the exact same tests on my local machine. In our CI/CD however, this behavior was consistent across multiple runs and branches.

justusschock avatar Sep 09 '19 13:09 justusschock

Hi, I have some time today to work on issues such as this one. Unfortunately I don't know what the problem is because everything works just fine in all my experiments. Still, I am a performance guy and I want this code to perform well for everybody :-) So: Have you had the opportunity to create a code snippet to reproduce the problem? That would help a lot. Best, Fabian

FabianIsensee avatar Sep 13 '19 08:09 FabianIsensee

Hi, Unfortunately I was not able to create a simple snippet for this, since we use batchgenerators in the midst of our framework. I will see, how much I can simplify things, but in general, we just use the multithreaded augmenter with a subclass of the DataLoader and some additional queues for interprocess communication. It works all fine with batchgenerators 0.19.3 but does not work with the latest release

justusschock avatar Sep 13 '19 12:09 justusschock

Can you be a little more specific? Is the CPU usage high but nothing happens? Is the CPU not properly utilized? Where does it seem to hang? Best, Fabian

FabianIsensee avatar Sep 13 '19 12:09 FabianIsensee

I can't tell you anything about CPU usage and stuff like that (sorry!) since this issue only occurs in our CI/CD and not on my local machine (thus it is hard to reproduce). I'll try my very best to reproduce it on my local machine with a minimal snippet.

justusschock avatar Sep 13 '19 12:09 justusschock

Hi there, any news on this issue?

FabianIsensee avatar Jan 31 '20 10:01 FabianIsensee

Hi Fabian, Thanks for getting back here. Unfortunately not. But at one point we did our own reimplementation of the multiprocessing part to better fit our pipeline, So I did not try any longer. Sorry!

justusschock avatar Jan 31 '20 10:01 justusschock

OK then. Do you have any idea what could have caused this?

FabianIsensee avatar Jan 31 '20 10:01 FabianIsensee

Unfortunately I don't. Maybe it was just some issue with our integration, since other's aren't experiencing the same.

justusschock avatar Jan 31 '20 10:01 justusschock

One thing our implementation does not do well is if you re-instantiate the multithreaded augmenter all the time. It can take a while to shut down and therefore cause delays. Does this sound familiar to you?

FabianIsensee avatar Jan 31 '20 10:01 FabianIsensee

We reinstantiated it every epoch (twice), but we had a look on that and it seems all the processes were terminated. For my understanding this should have been fine.

justusschock avatar Jan 31 '20 10:01 justusschock

The processes are not the issue. The problem lies in the pin_memory_loop which for some reason I don't understand does not terminate :-/

FabianIsensee avatar Jan 31 '20 11:01 FabianIsensee

Ah okay. This is just a theory, but have you tried to join the thread as they did in PyTorch? Besides that everything seems to be the same when it comes to the pin_memory part

justusschock avatar Jan 31 '20 11:01 justusschock

I just tried it and unfortunately it does not work. The thread is not joining. I believe this may be caused by some objects not being freed. Maybe the workers did not release some file handle in their end of the pipe or something causing the Queues to be not closed.

FabianIsensee avatar Jan 31 '20 11:01 FabianIsensee

This may be the case. Maybe it's just what the found here: https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L926 That they have to send one last thing which is just to check the event? But this is just some guessing based on the comparison of your code and theirs.

justusschock avatar Jan 31 '20 11:01 justusschock