Albert Zeyer issues

Results 300 issues of


                                            Albert Zeyer

PyTorch distributed training CPU OOM with sync_on_cpu

It trains fine for a while, and then often I get a CPU OOM, which looks like: ``` [2024-01-04 11:41:05,662] INFO: Start Job: Job Task: run ... RETURNN starting up,...

PyTorch pretraining, `train_step_callback`, staged training

(As initially discussed in #1120.) How to handle pretraining? The current suggested APIs (`get_model` and co) might needs to be changed, because we do not want to call `get_model` every...

PyTorch ONNX export

This issue is to track any aspects and issues on PyTorch (#1120) ONNX export. * [x] Working script for conversion (`export_to_onnx.py`) * [x] Fix issue with convolution * [x] Rename...

PyTorch

PyTorch automatic inf/nan detection, collecting statistics

`autograd.detect_anomaly` detects inf/nan in the backward pass. I want to have the same in the forward pass. With the possibility to whitelist a few special operations, modules or code blocks,...

PyTorch collect model statistics

We should collect some statistics (maybe optionally, configurable) (maybe only every N steps if too costly otherwise). Of: * weights * activations * gradients of weights * gradients of activations...

PyTorch CUDA OOM in distributed training

``` RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-13-21-05 (UTC+0000), pid 2003528, cwd /work/asr4/zeyer/setups-data/comb ined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.lmbYlKeoU6kT/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.lmbYlKeoU6kT/output/returnn.config'] ... Torch: Hostname cn-236, pid 2003531, using GPU 3....

PyTorch distributed training, could not unlink the shared memory file

``` [2023-12-31 11:33:54,580] INFO: Start Job: Job Task: run ... RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-11-34-07 (UTC+0000), pid 1868636, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/output/returnn.config'] Hostname: cn-237...

PT potential CUDA mem leak?

From log (`/work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.XPpeLPG9camH/log.run.1`), filtered the CUDA mem usage reports: ``` Memory usage (cuda): alloc cur 427.8MB alloc peak 427.8MB reserved cur 446.0MB reserved peak 446.0MB Memory usage (cuda): alloc cur...

PT DistributedDataParallel with mixed precision training

I noticed that the `DistributedDataParallel` module has the option `mixed_precision` which is for mixed precision training. We don't use that, even if the user specifies `torch_amp` to use mixed precision....

MultiGPU

PyTorch distributed: eval distributed as well

It's not really so difficult: Just split the dataset (that's the trickiest part), and then let each worker do eval, and correctly accumulate the results.

MultiGPU