returnn issues

Train proc manager restarts after Bus error crash, still consumes GPU memory, get OutOfMemoryError

``` ... ep 42 train, step 2112, ctc_4 1.526, ctc_8 1.055, ctc 0.874, consistency 0.461, aed_ce 0.307, aed_fer 0.050, grad_norm:p2 2.938, num_seqs 45, max_size:time 263912, max_size:out-spatial 105, mem_usage:cuda 64.7GB, 0.676...

albertz

Unexpected bus error encountered in worker

2

I got this two times now, at the end of successful training: ``` ... Uname: uname_result(system='Linux', node='w23g0002.hpc.itc.rwth-aachen.de', release='4.18.0-553.22.1.el8_10.x86_64', version='#1 SMP Wed Sep 25 09:20:43 UTC 2024', machine='x86_64') Load: (0.17, 0.24,...

albertz

`Tensor` `Dim`, support `Dim.capacity > max(Dim.dyn_size_ext)`

Currently, in multiple places (where exactly?) we have the assumption, for some tensor `x: Tensor` that `max(x.dims[i].dyn_size_ext.raw_tensor) == x.raw_tensor.shape[i]`. We want to support the case where `max(x.dims[i].dyn_size_ext.raw_tensor) < x.raw_tensor.shape[i]`. Specifically,...

albertz

TPU

JAX

RF (PT) meaning of losses with `as_error`

When you enable `calculate_exp_loss`, it will not calculate the exp loss for all the losses, but currently only those where `as_error=False`. This decision was somewhat arbitrary. The exp loss often...

albertz

Potential timeout during data caching in multi-node trainings

2

#1621 describes an issue where there would be a Gloo timeout in the worker processes when the master process takes longer than 30min for the eval step. This was fixed...

NeoLegends

bug

Add torch scaled dot product attention (FlashAttention)

dorian-K

Documentation outdated (2025)

I just want to raise this here. Also including the README. A lot of parts are still TF specific. And many of those even don't mention that. PyTorch relevant documentation...

albertz

FileCache no space left on device race condition

3

Again a crash. It ultimately failed with this (after retrying a few times): ``` OSError: [Errno 28] No space left on device ``` Log: ``` FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F xPUVJtw1EeN/output/dataset/train/data-00405-of-00848.arrow...

albertz

test_ConvLayer_empty_out fails: InvalidArgumentError: Incompatible shapes: [7] vs. [1,1,5]

1

CI run log [tf-tests (3.8, 2.10.0, TEST=TFNetworkLayer)](https://github.com/rwth-i6/returnn/actions/runs/18276719792/job/52030410528?pr=1774#logs). ``` Python env: python is /opt/hostedtoolcache/Python/3.8.18/x64/bin/python Python 3.8.18 NumPy: 1.24.4 TensorFlow: v2.10.0-rc3-6-g359c3cdfc5f 2.10.0 /home/runner/.local/lib/python3.8/site-packages/tensorflow/__init__.py ``` Relevant log: ``` ___________________________ test_ConvLayer_empty_out ___________________________ Traceback (most...

albertz

MultiProcDataset + Postprocessing = CPU overcommit?

6

@Icemole Reports a case where he uses a PostprocessingDataset inside a MultiProcDataset. He finds that each MultiProcWorker uses more than one thread for its computation, resulting in a CPU overcommit...

NeoLegends

returnn
returnn copied to clipboard

Metadata

Train proc manager restarts after Bus error crash, still consumes GPU memory, get OutOfMemoryError

Unexpected bus error encountered in worker

`Tensor` `Dim`, support `Dim.capacity > max(Dim.dyn_size_ext)`

RF (PT) meaning of losses with `as_error`

Potential timeout during data caching in multi-node trainings

Add torch scaled dot product attention (FlashAttention)

Documentation outdated (2025)

FileCache no space left on device race condition

test_ConvLayer_empty_out fails: InvalidArgumentError: Incompatible shapes: [7] vs. [1,1,5]

MultiProcDataset + Postprocessing = CPU overcommit?

← Metadata

Owner

Metadata

returnn returnn copied to clipboard

Metadata

← Metadata

Owner

Metadata

returnn
returnn copied to clipboard