returnn issues

PyTorch pretraining, `train_step_callback`, staged training

7

(As initially discussed in #1120.) How to handle pretraining? The current suggested APIs (`get_model` and co) might needs to be changed, because we do not want to call `get_model` every...

albertz

PyTorch ONNX export

98

This issue is to track any aspects and issues on PyTorch (#1120) ONNX export. * [x] Working script for conversion (`export_to_onnx.py`) * [x] Fix issue with convolution * [x] Rename...

albertz

PyTorch

Export to ONNX

5

Hi Is there a way to convert pretrained `returnn` networks to `ONNX` or at least save the network to `tensorflow's saved model` format? Best Musharraf

mush42

TensorFlow

PyTorch automatic inf/nan detection, collecting statistics

1

`autograd.detect_anomaly` detects inf/nan in the backward pass. I want to have the same in the forward pass. With the possibility to whitelist a few special operations, modules or code blocks,...

albertz

PyTorch collect model statistics

We should collect some statistics (maybe optionally, configurable) (maybe only every N steps if too costly otherwise). Of: * weights * activations * gradients of weights * gradients of activations...

albertz

PyTorch CUDA OOM in distributed training

11

``` RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-13-21-05 (UTC+0000), pid 2003528, cwd /work/asr4/zeyer/setups-data/comb ined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.lmbYlKeoU6kT/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.lmbYlKeoU6kT/output/returnn.config'] ... Torch: Hostname cn-236, pid 2003531, using GPU 3....

albertz

PyTorch distributed training, could not unlink the shared memory file

``` [2023-12-31 11:33:54,580] INFO: Start Job: Job Task: run ... RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-11-34-07 (UTC+0000), pid 1868636, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/output/returnn.config'] Hostname: cn-237...

albertz

PT potential CUDA mem leak?

2

From log (`/work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.XPpeLPG9camH/log.run.1`), filtered the CUDA mem usage reports: ``` Memory usage (cuda): alloc cur 427.8MB alloc peak 427.8MB reserved cur 446.0MB reserved peak 446.0MB Memory usage (cuda): alloc cur...

albertz

Onnx: ceildiv yields wrong result during conversion

12

We had one particular problem when converting a Conformer acoustic model from TensorFlow to ONNX: Calculating the sequence lengths after convolution resulted in wrong calculations on ONNX side. The issue...

Gerstenberger

PT DistributedDataParallel with mixed precision training

5

I noticed that the `DistributedDataParallel` module has the option `mixed_precision` which is for mixed precision training. We don't use that, even if the user specifies `torch_amp` to use mixed precision....

albertz

MultiGPU

returnn
returnn copied to clipboard

Metadata

PyTorch pretraining, `train_step_callback`, staged training

PyTorch ONNX export

Export to ONNX

PyTorch automatic inf/nan detection, collecting statistics

PyTorch collect model statistics

PyTorch CUDA OOM in distributed training

PyTorch distributed training, could not unlink the shared memory file

PT potential CUDA mem leak?

Onnx: ceildiv yields wrong result during conversion

PT DistributedDataParallel with mixed precision training

← Metadata

Owner

Metadata

returnn returnn copied to clipboard

Metadata

← Metadata

Owner

Metadata

returnn
returnn copied to clipboard