Albert Zeyer comments

Results 963 comments of


                                            Albert Zeyer

PT DistributedDataParallel with mixed precision training

From [the documentation on AMP on Working with Multiple GPUs](https://pytorch.org/docs/stable/notes/amp_examples.html#id9), it sounds like it should already be fine in case we use [DistributedDataParallel, one GPU per process](https://pytorch.org/docs/stable/notes/amp_examples.html#id11).

PT DistributedDataParallel with mixed precision training

> so you didn't observe a decrease in memory consumption? Compared to what? Of course, enabling AMP (I usually use bfloat16 without grad scaler) reduces GPU memory. But that is...

Linear memory growth, memory leak, maybe in convolution?

As written on Slack, this looks like a very classical memory leak (you see memory logs in the first log) in the main proc. The mem usage increases very linearly....

Linear memory growth, memory leak, maybe in convolution?

As far as I see, the only thing a bit uncommon is the `fast_bw` loss here, using Sprint (RASR) in a subprocess to compute the FSA. I wonder if that...

Linear memory growth, memory leak, maybe in convolution?

Can you run this in a memory profiler and report the observations?

Linear memory growth, memory leak, maybe in convolution?

Also, can you add some more details: - Since when does this happen? - What did you change after this happens? TensorFlow version? RETURNN version? Something else?

Linear memory growth, memory leak, maybe in convolution?

> An identical setup with `feature_args = {"class": "LogMelNetwork", "wave_norm": True, "frame_size": 200, "frame_shift": 80, "fft_size": 256}` > works fine. Are you sure? Can you try again and post also...

Linear memory growth, memory leak, maybe in convolution?

As I wrote on Slack: It's very strange that the only change is the feature extraction, as part of the network. So only the TF computation graph changes a bit,...

Linear memory growth, memory leak, maybe in convolution?

How often do you get hangs vs OOM? How long does it take to get that? Next time when it hangs, it would be very interesting if you could login...

Linear memory growth, memory leak, maybe in convolution?

(This is really where it hangs? Or is this still during normal training?) What is the native stacktrace (via GDB)? You should see it hang somewhere inside the native TF...