returnn issues

Torch gradient_checkpoint_scope potential memory leak

Training was running fine for 29 subeochs but then crashed with CPU OOM. While I sometimes see CPU OOMs in my setup, this is usually after longer trainings. So it...

albertz

RuntimeError: CUDA error: an illegal memory access was encountered

1

``` ... ep 1 train, step 294, ctc_4 4.553, ctc_8 4.531, ctc 4.510, num_seqs 11, max_size:time 201384, max_size:out-spatial 149, mem_usage:cuda:0 5.9GB, 0.411 sec/step ep 1 train, step 294, ctc_4 4.516,...

albertz

Torch: print model at log verbosity 3

Closes #1575 This probably didn't need a PR but I'm unsure if there was a good reason for v4 verbosity or not. Feel free to immediately merge if there wasn't.

NeoLegends

Torch: print model at log verbosity 3

1

I find log verbosity 3 a reasonable verbosity for "daily" work/trainings, but I'm missing the model structure in the log. This is because it's only printed at v4 level. Is...

NeoLegends

multiprocessing: OSError: AF_UNIX path too long

11

Yesterday I started a training with DistributeFilesDataset and file caching which today crashed and consistently crashes after restarting with what I think is `OSError: AF_UNIX path too long` in the...

michelwi

Ignore a single broken gradient

2

In my current language model training I sometimes get "nan" gradients, which break the training. Surprisingly, just restarting the training from the last checkpoint is often enough uncertainty to resume...

JackTemaki

returnn
returnn copied to clipboard

Metadata

Torch gradient_checkpoint_scope potential memory leak

RuntimeError: CUDA error: an illegal memory access was encountered

Torch: print model at log verbosity 3

Torch: print model at log verbosity 3

multiprocessing: OSError: AF_UNIX path too long

Ignore a single broken gradient

Datasets: blocklist in addition to allowlist for segment list file

Some fix for invalid broadcasting

RF cross_entropy (matmul, gather) should maybe have allow_broadcast?

Remove outdated Python header attribs?

← Metadata

Owner

Metadata

returnn returnn copied to clipboard

Metadata

← Metadata

Owner

Metadata

returnn
returnn copied to clipboard