returnn
returnn copied to clipboard
The RWTH extensible training framework for universal recurrent neural networks
``` ... ep 42 train, step 2112, ctc_4 1.526, ctc_8 1.055, ctc 0.874, consistency 0.461, aed_ce 0.307, aed_fer 0.050, grad_norm:p2 2.938, num_seqs 45, max_size:time 263912, max_size:out-spatial 105, mem_usage:cuda 64.7GB, 0.676...
I got this two times now, at the end of successful training: ``` ... Uname: uname_result(system='Linux', node='w23g0002.hpc.itc.rwth-aachen.de', release='4.18.0-553.22.1.el8_10.x86_64', version='#1 SMP Wed Sep 25 09:20:43 UTC 2024', machine='x86_64') Load: (0.17, 0.24,...
Currently, in multiple places (where exactly?) we have the assumption, for some tensor `x: Tensor` that `max(x.dims[i].dyn_size_ext.raw_tensor) == x.raw_tensor.shape[i]`. We want to support the case where `max(x.dims[i].dyn_size_ext.raw_tensor) < x.raw_tensor.shape[i]`. Specifically,...
When you enable `calculate_exp_loss`, it will not calculate the exp loss for all the losses, but currently only those where `as_error=False`. This decision was somewhat arbitrary. The exp loss often...
#1621 describes an issue where there would be a Gloo timeout in the worker processes when the master process takes longer than 30min for the eval step. This was fixed...
I just want to raise this here. Also including the README. A lot of parts are still TF specific. And many of those even don't mention that. PyTorch relevant documentation...
Again a crash. It ultimately failed with this (after retrying a few times): ``` OSError: [Errno 28] No space left on device ``` Log: ``` FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F xPUVJtw1EeN/output/dataset/train/data-00405-of-00848.arrow...
CI run log [tf-tests (3.8, 2.10.0, TEST=TFNetworkLayer)](https://github.com/rwth-i6/returnn/actions/runs/18276719792/job/52030410528?pr=1774#logs). ``` Python env: python is /opt/hostedtoolcache/Python/3.8.18/x64/bin/python Python 3.8.18 NumPy: 1.24.4 TensorFlow: v2.10.0-rc3-6-g359c3cdfc5f 2.10.0 /home/runner/.local/lib/python3.8/site-packages/tensorflow/__init__.py ``` Relevant log: ``` ___________________________ test_ConvLayer_empty_out ___________________________ Traceback (most...
@Icemole Reports a case where he uses a PostprocessingDataset inside a MultiProcDataset. He finds that each MultiProcWorker uses more than one thread for its computation, resulting in a CPU overcommit...