bonito
bonito copied to clipboard
Multi GPU train fails
Log:
(bonitox) usamec@bonito3train:~$ bonito train --multi-gpu bonitobase
[loading data]
[loading model]
[0/1184825]: 0%| | [00:00]/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
Traceback (most recent call last):
File "/home/usamec/miniconda3/envs/bonitox/bin/bonito", line 33, in <module>
sys.exit(load_entry_point('ont-bonito', 'console_scripts', 'bonito')())
File "/home/usamec/bonito/bonito/__init__.py", line 39, in main
args.func(args)
File "/home/usamec/bonito/bonito/cli/train.py", line 81, in main
use_amp=args.amp, lr_scheduler=lr_scheduler
File "/home/usamec/bonito/bonito/training.py", line 160, in train
losses = criterion(log_probs, targets.to(device), lengths.to(device))
File "/home/usamec/bonito/bonito/training.py", line 130, in ctc_label_smoothing_loss
loss = ctc_loss(log_probs.to(torch.float32), targets, log_probs_lengths, lengths, reduction='mean')
File "/home/usamec/miniconda3/envs/bonitox/lib/python3.7/site-packages/torch-1.5.0-py3.7-linux-x86_64.egg/torch/nn/functional.py", line 2052, in ctc_loss
zero_infinity)
RuntimeError: target_lengths must be of size batch_size
I run it with default dataset from the current master branch.
Just noticed that it uses wrong loss. This if hasattr(model, 'seqdist'): from https://github.com/nanoporetech/bonito/blob/07f885ee9a1c0fef66e8177f00615c12128f453d/bonito/cli/train.py#L71
Should be model.module for distributed setting, or decided before dataparallel thingy is in place.
Yes, because the new model train so much faster I hadn't used the multi-GPU training path with them - thanks for flagging.
Yep, 4-5 days seem doable.
Also it needs to patch TNC to NTC dimensions (due to dataloaded behaviour). Never mind, I will fix it for myself :)
With automatic mixed precision on a V100, a 768 wide model should take less than 8 hours for 5 epochs. Here's a training log on the currently available training data -
time duration epoch train_loss validation_loss validation_mean validation_median
2020-10-27 19:44:30.334400 5292 1 0.11299467911922507 0.11333289820486338 95.155922819823 95.71865443425077
2020-10-27 21:19:22.062806 5299 2 0.0975474908561753 0.09649423862210951 95.85441846008399 96.47058823529412
2020-10-27 22:54:15.527969 5301 3 0.08439798653644677 0.08799622487439326 96.26646066241035 96.86411149825784
2020-10-28 00:29:12.069757 5304 4 0.07599978859276102 0.08264038991345576 96.50189359968385 97.10144927536231
2020-10-28 02:04:06.179902 5301 5 0.06826694875079303 0.0821038224518195 96.56739781833114 97.1590909090909
Oh, I should install that AMP :)
For sure! It's not on PyPI which is a pain however the functionality has been upstreamed into PyTorch since 1.6.
Yep, 4-5 days seem doable.
Also it needs to patch TNC to NTC dimensions (due to dataloaded behaviour). Never mind, I will fix it for myself :)
Were you able to patch the multi-GPU fix? I need some help in this issue also. I am getting the same exact error for multi-GPU training and not much experience in torch, so I would appreciate any help.
I am also getting same error when using the test data set. Is there a solution for this.
bonito train test --multi-gpu --directory /home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/models/dna_r9.4.1
[loading data]
[loading model]
[0/1221470]: 0%| | [00:00]/opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
Traceback (most recent call last):
File "/home/callum/miniconda3/envs/bonito/bin/bonito", line 8, in <module>
sys.exit(main())
File "/home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/__init__.py", line 39, in main
args.func(args)
File "/home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/cli/train.py", line 89, in main
train_loss, duration = train(
File "/home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/training.py", line 198, in train
losses = criterion(log_probs, targets.to(device), lengths.to(device))
File "/home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/training.py", line 167, in ctc_label_smoothing_loss
loss = ctc_loss(log_probs.to(torch.float32), targets, log_probs_lengths, lengths, reduction='mean')
File "/home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/functional.py", line 2051, in ctc_loss
return torch.ctc_loss(log_probs, targets, input_lengths, target_lengths, blank, _Reduction.get_enum(reduction),
RuntimeError: target_lengths must be of size batch_size
Anyone else encounter this? Thanks.
loading model ~/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/cuda/init.py:104: UserWarning: A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
File "/home/torben/opt/anaconda3/envs/bonito/bin/bonito", line 33, in
I found a solution. Here it is for posterity.
Create the usual separate environment for this.
i) Install the nightly build of PyTorch (version 1.9). I used version 11.1 of the toolkit, but you could probably get away with 11.2 as well. ii) Fix seqdist to accept cupy-111 and install it with pip. iii) Install bonito using the developer instructions.
I’m basecalling now. Throughput is a bit below 100 reads/s which is less than I’d hoped.
Anyone have suggestions for ways of making it faster on an A100?
On Apr 1, 2021, at 19:14, Torben Nielsen @.***> wrote:
Anyone else encounter this? Thanks.
loading model ~/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/cuda/init.py:104: UserWarning: A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) Traceback (most recent call last): File "/home/torben/opt/anaconda3/envs/bonito/bin/bonito", line 33, in
sys.exit(load_entry_point('ont-bonito', 'console_scripts', 'bonito')()) File "/home/torben/opt/bonito/bonito/init.py", line 39, in main args.func(args) File "/home/torben/opt/bonito/bonito/cli/basecaller.py", line 28, in main model = load_model(args.model_directory, args.device, weights=int(args.weights)) File "/home/torben/opt/bonito/bonito/util.py", line 314, in load_model model.to(device) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 186, in _apply self.flatten_parameters() File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 172, in flatten_parameters torch._cudnn_rnn_flatten_weight( RuntimeError: CUDA error: no kernel image is available for execution on the device
Here’s the window for the PyTorch installation:
On Apr 1, 2021, at 19:45, Torben Nielsen @.***> wrote:
I found a solution. Here it is for posterity.
Create the usual separate environment for this.
i) Install the nightly build of PyTorch (version 1.9). I used version 11.1 of the toolkit, but you could probably get away with 11.2 as well. ii) Fix seqdist to accept cupy-111 and install it with pip. iii) Install bonito using the developer instructions.
I’m basecalling now. Throughput is a bit below 100 reads/s which is less than I’d hoped.
Anyone have suggestions for ways of making it faster on an A100?
On Apr 1, 2021, at 19:14, Torben Nielsen @.***> wrote:
Anyone else encounter this? Thanks.
loading model ~/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/cuda/init.py:104: UserWarning: A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) Traceback (most recent call last): File "/home/torben/opt/anaconda3/envs/bonito/bin/bonito", line 33, in
sys.exit(load_entry_point('ont-bonito', 'console_scripts', 'bonito')()) File "/home/torben/opt/bonito/bonito/init.py", line 39, in main args.func(args) File "/home/torben/opt/bonito/bonito/cli/basecaller.py", line 28, in main model = load_model(args.model_directory, args.device, weights=int(args.weights)) File "/home/torben/opt/bonito/bonito/util.py", line 314, in load_model model.to(device) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 186, in _apply self.flatten_parameters() File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 172, in flatten_parameters torch._cudnn_rnn_flatten_weight( RuntimeError: CUDA error: no kernel image is available for execution on the device
I'm getting multi-GPU training failure also (v0.4.0 cuda 11.1), with the following error message
[loading data]
[validation set not found: splitting training set]
[loading model]
Traceback (most recent call last):
File "/hps/software/users/iqbal/mbhall/miniconda3/envs/bonito/bin/bonito", line 33, in <module>
sys.exit(load_entry_point('ont-bonito-cuda111', 'console_scripts', 'bonito')())
File "/hps/nobackup/iqbal/mbhall/tubby/bonito/bonito/__init__.py", line 34, in main
args.func(args)
File "/hps/nobackup/iqbal/mbhall/tubby/bonito/bonito/cli/train.py", line 83, in main
trainer = Trainer(model, device, train_loader, valid_loader, use_amp=half_supported() and not args.no_amp)
File "/hps/nobackup/iqbal/mbhall/tubby/bonito/bonito/training.py", line 107, in __init__
self.criterion = criterion or (model.seqdist.ctc_loss if hasattr(model, 'seqdist') else model.ctc_label_smoothing_loss)
File "/hps/software/users/iqbal/mbhall/miniconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 947, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'ctc_label_smoothing_loss'