Albert Zeyer
Albert Zeyer
I reproduced the problem in a simple script: https://github.com/albertz/playground/blob/master/test-torch-stft-cufft-internal-error.py It often happens when I allocate most GPU memory, and then do a couple of `torch.stft` calls. This is the same...
This is very deterministic, when I restart, I get the same crash exactly in the same crash, also on other nodes.
Potentially related: https://github.com/pytorch/pytorch/issues/116177 https://github.com/NVlabs/tiny-cuda-nn/issues/387 https://github.com/NVIDIA/nccl/issues/962
I get the same problem also with Gloo backend, i.e. also CUDA OOM, although then it crashes in a different way with an abort. ``` ... ep 1 train, step...
> This is very deterministic, when I restart, I get the same crash exactly in the same crash, also on other nodes. I realized, this is using `"torch_distributed": {"reduce_type": "param",...
One workaround is using the newly introduced `torch_distributed` `sync_on_cpu=True` option, which first moves all params to CPU, then does the sync (which would use Gloo on CPU), then moves it...
Note, the 1080 has 10.9GB of memory, just the parameters take only 615.9MB of memory. The `all_reduce` is in blocking mode (just the default), and we do this separately for...
I also asked in the forums: https://discuss.pytorch.org/t/cuda-oom-in-distributed-training-without-nvlink/194704
Note, in the [PYTORCH_CUDA_ALLOC_CONF doc](https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), maybe the `expandable_segments` option might help us? > If set to True, this setting instructs the allocator to create CUDA allocations that can later be...
I introduced the option `reset_dev_memory_caches`, which calls `gc.collect()` and then `torch.cuda.empty_cache()`. Before (via `/work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.yr9RPZ4KpDXG/log.run.1`, `alias/exp2023_04_25_rf/chunked_aed_import/chunk-C20-R15-H2-bs22k/train`): ``` Memory usage (cuda): alloc cur 427.8MB alloc peak 427.8MB reserved cur 446.0MB reserved peak...