returnn icon indicating copy to clipboard operation
returnn copied to clipboard

MultiProcDataset + Postprocessing = CPU overcommit?

Open NeoLegends opened this issue 2 months ago • 6 comments

@Icemole Reports a case where he uses a PostprocessingDataset inside a MultiProcDataset. He finds that each MultiProcWorker uses more than one thread for its computation, resulting in a CPU overcommit (because the number of assigned CPUs matches the number of data processes), adversely affecting performance.

I wonder: is this a setup/config issue or a systematic issue of our current data loading pipeline? I cannot really believe that this hasn't happened anywhere before. E.g. speed perturbation seems like one use case where this would happen normally (due to using convolutions)? We use librosa for that, does it not use more than one core? In his setup there is torchaudio.add_noise.

ReturnnTrainingJob sets OMP_NUM_THREADS=self.rqmt["cpu"], so I see where the overcommit can come from when the child processes inherit that variable and think they have this many threads available. So, should we perhaps set OMP_NUM_THREADS=1 for the dataset worker processes (unless specified otherwise)? I don't think it's wise to set OMP_NUM_THREADS=1 for the main proc. I can have this wired up quickly, if you think this is sensible @albertz.

htop screenshot: Image

NeoLegends avatar Oct 15 '25 08:10 NeoLegends

resulting in a CPU overcommit (because the number of assigned CPUs matches the number of data processes), adversely affecting performance.

How do you know this is really negatively affecting performance? I think you cannot really say that in general. In general, or actually in most cases, I would not assume that there is a problem. I would also assume, only the main thread is also mostly active, and the other threads are more idle. E.g. if this is some thread by Numpy or so, most of the actual Python logic still happens in the main thread, and the other thread only becomes active when there is some actual compute happening, which is much less. If the CPU uses hyperthreading, and/or the data is local, I would also assume this is still better performance overall, to have multiple threads there.

What kind of computation do you (or @Icemole) do in the PostprocessingDataset?

Why do you want to limit it to a single thread? I don't think this is a good idea in general.

I wonder: is this a setup/config issue or a systematic issue of our current data loading pipeline?

What do you mean by issue? You did not expect this, and don't understand/know where the thread is coming from?

Why is this related to the current data loading pipeline?

albertz avatar Oct 15 '25 09:10 albertz

I would also assume, only the main thread is also mostly active, and the other threads are more idle. E.g. if this is some thread by Numpy or so, most of the actual Python logic still happens in the main thread, and the other thread only becomes active when there is some actual compute happening, which is much less.

Look at the load values. The job is assigned 48 cores by SLURM, but seems to produce a ~190 15min load average. This means that it's not just the main threads that are active, and that there is a lot of context-switching going on. I mean sure, I have not measured this, but this looks pathological to me. I think this is a 4 GPU job with 12 cores rqmt, so SLURM assigns 48 cores. Together with the logic from ReturnnTrainingJob this means each worker process sees OMP_NUM_THREADS=12. I think MultiProcDataset was configured here to use four worker procs per GPU, so that means we get 4 GPU * 4 workers * 12 threads = 192 threads that want to be active (hence the 188ish load).

What kind of computation do you (or @Icemole) do in the PostprocessingDataset?

I think this postprocessor adds noise data with some specific SNR to the training audio data. But it doesn't actually matter what it does, it should not cause this erratic behavior.

What do you mean by issue? You did not expect this, and don't understand/know where the thread is coming from?

Kind of. I know where the threads are coming from (OMP_NUM_THREADS=12 in this case; set by ReturnnTrainingJob), but I would expect that each dataset worker represents one core of load, so that I can do correct budgeting/set my sisyphus rqmt correctly.

Why is this related to the current data loading pipeline?

I think this behavior is erroneous and we should do something about this. I think the user expects that when they use MultiProcDataset(num_workers=4) that they will generate load for four more processor cores, and not 4*OMP_NUM_THREADS.

If OMP_NUM_THREADS goes into how much load your data workers generate you are always going to overload/overcommit your job allocation and in the current setup, the user does not have any way to fix this/reduce the overcommit. In this situation you cannot set the job rqmt correctly anymore. Obviously increasing the job's rqmt to more cores is not a solution because then the worker processes simply end up spawning more threads. To set it correctly you would want to take into account the number of worker processes + the main proc + overhead/leaving some space.

If we force OMP_NUM_THREADS=1 on the workers suddenly the pipeline becomes predictable again in how much load it can generate/how many processor threads it is going to use.

NeoLegends avatar Oct 15 '25 10:10 NeoLegends

But it doesn't actually matter what it does, it should not cause this erratic behavior.

I mean, of course unless you're manually spawning threads. But this is not what the code does, it merely uses torchaudio.functional.add_noise + some numpy functions, which through their respective BLAS plumbing then end up wanting OMP_NUM_THREADS.

NeoLegends avatar Oct 15 '25 10:10 NeoLegends

Look at the load values. The job is assigned 48 cores by SLURM, but seems to produce a ~190 15min load average.

But why is this bad? What I mean is, the overall speed of data loading, and maybe also the overall training speed, is this negatively affected? Or maybe it even is faster with this configuration?

I know where the threads are coming from (OMP_NUM_THREADS=12 in this case; set by ReturnnTrainingJob)

I mean more specifically: You assume this is via some Numpy computation? Did you actually verify this? Just look at py-spy, or with GDB (to also see native threads). What are those threads?

I think this behavior is erroneous and we should do something about this. I think the user expects that when they use MultiProcDataset(num_workers=4) that they will generate load for four more processor cores, and not 4*OMP_NUM_THREADS.

I'm not sure. I actually would expect, and also want to have 4*OMP_NUM_THREADS being used.

Unless you can really show me that this is bad. Then I still would expect this behavior, but then we should maybe do sth about it. We could have some option like custom_env_overwrites for the MultiProcDataset and there you can overwrite OMP_NUM_THREADS to what you like. We can also discuss about better defaults maybe.

albertz avatar Oct 15 '25 10:10 albertz

I'm not sure. I actually would expect, and also want to have 4*OMP_NUM_THREADS being used.

Just for completeness: why would you want this/why do you think this behavior is desirable? This seems not very controllable to me, and I'd rather expect num_workers=N to be the knob I want to set. I can overcommit the CPU just as well (and get efficient use of hyperthreading etc) by spawning more worker processes than I have cores allocated to the job, with the advantage that I have better control over the numbers.

NeoLegends avatar Oct 15 '25 13:10 NeoLegends

I explained this already: When considering hyperthreading, and/or locality of data, I can imagine that multiple threads per each worker can anyway be beneficial (independent of how much other workers there are, whether you are overcommitting on the CPU or not). I.e. that's why I think it can make sense to just always have the same number of threads per worker, no matter how much workers you use, and the total number of threads can be much larger than the available cores. And overcommitting the CPU is maybe a good thing.

But again: This is my assumption, which is now the opposite of your assumption. That's why I asked to actually measure this. We anyway should have some way to make this explicit, e.g. like the custom_env_overwrites that I proposed. And then I would be very much interested: For some given amount of num workers (let's say 4 workers), what is the optimal number of threads per worker, that gives you the fastest dataset loading? And what is the optimal number of threads per worker, that gives you the fastest training? (The second question might have a different optimum, because there, CPU overcommitment can maybe matter more, as it might also influence the main thread in the main proc.)

albertz avatar Oct 15 '25 18:10 albertz