Albert Zeyer
Albert Zeyer
From [the documentation on AMP on Working with Multiple GPUs](https://pytorch.org/docs/stable/notes/amp_examples.html#id9), it sounds like it should already be fine in case we use [DistributedDataParallel, one GPU per process](https://pytorch.org/docs/stable/notes/amp_examples.html#id11).
> so you didn't observe a decrease in memory consumption? Compared to what? Of course, enabling AMP (I usually use bfloat16 without grad scaler) reduces GPU memory. But that is...
As written on Slack, this looks like a very classical memory leak (you see memory logs in the first log) in the main proc. The mem usage increases very linearly....
As far as I see, the only thing a bit uncommon is the `fast_bw` loss here, using Sprint (RASR) in a subprocess to compute the FSA. I wonder if that...
Can you run this in a memory profiler and report the observations?
Also, can you add some more details: - Since when does this happen? - What did you change after this happens? TensorFlow version? RETURNN version? Something else?
> An identical setup with `feature_args = {"class": "LogMelNetwork", "wave_norm": True, "frame_size": 200, "frame_shift": 80, "fft_size": 256}` > works fine. Are you sure? Can you try again and post also...
As I wrote on Slack: It's very strange that the only change is the feature extraction, as part of the network. So only the TF computation graph changes a bit,...
How often do you get hangs vs OOM? How long does it take to get that? Next time when it hangs, it would be very interesting if you could login...
(This is really where it hangs? Or is this still during normal training?) What is the native stacktrace (via GDB)? You should see it hang somewhere inside the native TF...