Albert Zeyer
Albert Zeyer
> What is the native stacktrace (via GDB)? You should see it hang somewhere inside the native TF lib. Some current observations on that: Ignoring any waiting threads (`__GI___poll`, `__pthread_cond_timedwait`,...
Ah very interesting. Initially, you said, you cannot reproduce it? So what change was relevant now? Having the larger batch size?
Can you report that in the TensorFlow GitHub issues and link it here?
If you take the same script but replace TF by PyTorch, how does that behave?
If you play around with some other things, e.g. `n_feat` or `n_hidden`, does the mem leak still occur?
The PR for padding the time dim (https://github.com/rwth-i6/returnn/pull/1468) is merged now. To continue the discussion here: > I compared padding the time dim for the first two layers on the...
One question is also why it hangs at exit. **Edit** Moved that as a separate issue to #1497.
> RuntimeError: CUDA error: unspecified launch failure Maybe related: https://github.com/pytorch/pytorch/issues/74235
I'm closing for now, assuming a hardware issue. Reopen if there is any indication that there is some other problem, or sth we can do about it.
I'm getting this quite frequently now. In most cases in a multi-GPU training setup on Nvidia 1080 GPUs (but probably that's just because that is currently my main setup, and...