bonito icon indicating copy to clipboard operation
bonito copied to clipboard

Multi GPU train fails

Open usamec opened this issue 3 years ago • 12 comments

Log:

(bonitox) usamec@bonito3train:~$ bonito train --multi-gpu bonitobase
[loading data]
[loading model]
[0/1184825]:   0%|                                                                         | [00:00]/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().


/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().


Traceback (most recent call last):
  File "/home/usamec/miniconda3/envs/bonitox/bin/bonito", line 33, in <module>
    sys.exit(load_entry_point('ont-bonito', 'console_scripts', 'bonito')())
  File "/home/usamec/bonito/bonito/__init__.py", line 39, in main
    args.func(args)
  File "/home/usamec/bonito/bonito/cli/train.py", line 81, in main
    use_amp=args.amp, lr_scheduler=lr_scheduler
  File "/home/usamec/bonito/bonito/training.py", line 160, in train
    losses = criterion(log_probs, targets.to(device), lengths.to(device))
  File "/home/usamec/bonito/bonito/training.py", line 130, in ctc_label_smoothing_loss
    loss = ctc_loss(log_probs.to(torch.float32), targets, log_probs_lengths, lengths, reduction='mean')
  File "/home/usamec/miniconda3/envs/bonitox/lib/python3.7/site-packages/torch-1.5.0-py3.7-linux-x86_64.egg/torch/nn/functional.py", line 2052, in ctc_loss
    zero_infinity)
RuntimeError: target_lengths must be of size batch_size

I run it with default dataset from the current master branch.

usamec avatar Nov 05 '20 14:11 usamec

Just noticed that it uses wrong loss. This if hasattr(model, 'seqdist'): from https://github.com/nanoporetech/bonito/blob/07f885ee9a1c0fef66e8177f00615c12128f453d/bonito/cli/train.py#L71

Should be model.module for distributed setting, or decided before dataparallel thingy is in place.

usamec avatar Nov 05 '20 15:11 usamec

Yes, because the new model train so much faster I hadn't used the multi-GPU training path with them - thanks for flagging.

iiSeymour avatar Nov 05 '20 15:11 iiSeymour

Yep, 4-5 days seem doable.

Also it needs to patch TNC to NTC dimensions (due to dataloaded behaviour). Never mind, I will fix it for myself :)

usamec avatar Nov 05 '20 17:11 usamec

With automatic mixed precision on a V100, a 768 wide model should take less than 8 hours for 5 epochs. Here's a training log on the currently available training data -

time                        duration  epoch  train_loss           validation_loss      validation_mean    validation_median
2020-10-27 19:44:30.334400  5292      1      0.11299467911922507  0.11333289820486338  95.155922819823    95.71865443425077
2020-10-27 21:19:22.062806  5299      2      0.0975474908561753   0.09649423862210951  95.85441846008399  96.47058823529412
2020-10-27 22:54:15.527969  5301      3      0.08439798653644677  0.08799622487439326  96.26646066241035  96.86411149825784
2020-10-28 00:29:12.069757  5304      4      0.07599978859276102  0.08264038991345576  96.50189359968385  97.10144927536231
2020-10-28 02:04:06.179902  5301      5      0.06826694875079303  0.0821038224518195   96.56739781833114  97.1590909090909

iiSeymour avatar Nov 05 '20 18:11 iiSeymour

Oh, I should install that AMP :)

usamec avatar Nov 05 '20 18:11 usamec

For sure! It's not on PyPI which is a pain however the functionality has been upstreamed into PyTorch since 1.6.

iiSeymour avatar Nov 05 '20 18:11 iiSeymour

Yep, 4-5 days seem doable.

Also it needs to patch TNC to NTC dimensions (due to dataloaded behaviour). Never mind, I will fix it for myself :)

Were you able to patch the multi-GPU fix? I need some help in this issue also. I am getting the same exact error for multi-GPU training and not much experience in torch, so I would appreciate any help.

gulsumgudukbay avatar Dec 10 '20 18:12 gulsumgudukbay

I am also getting same error when using the test data set. Is there a solution for this.

bonito train test --multi-gpu --directory /home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/models/dna_r9.4.1
[loading data]
[loading model]
[0/1221470]:   0%|                                                                         | [00:00]/opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
/opt/conda/conda-bld/pytorch_1587428207430/work/aten/src/ATen/native/cudnn/RNN.cpp:1269: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().

Traceback (most recent call last):
  File "/home/callum/miniconda3/envs/bonito/bin/bonito", line 8, in <module>
    sys.exit(main())
  File "/home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/__init__.py", line 39, in main
    args.func(args)
  File "/home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/cli/train.py", line 89, in main
    train_loss, duration = train(
  File "/home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/training.py", line 198, in train
    losses = criterion(log_probs, targets.to(device), lengths.to(device))
  File "/home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/bonito/training.py", line 167, in ctc_label_smoothing_loss
    loss = ctc_loss(log_probs.to(torch.float32), targets, log_probs_lengths, lengths, reduction='mean')
  File "/home/callum/miniconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/functional.py", line 2051, in ctc_loss
    return torch.ctc_loss(log_probs, targets, input_lengths, target_lengths, blank, _Reduction.get_enum(reduction),
RuntimeError: target_lengths must be of size batch_size

callumparr avatar Apr 01 '21 14:04 callumparr

Anyone else encounter this? Thanks.

loading model ~/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/cuda/init.py:104: UserWarning: A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) Traceback (most recent call last): File "/home/torben/opt/anaconda3/envs/bonito/bin/bonito", line 33, in sys.exit(load_entry_point('ont-bonito', 'console_scripts', 'bonito')()) File "/home/torben/opt/bonito/bonito/init.py", line 39, in main args.func(args) File "/home/torben/opt/bonito/bonito/cli/basecaller.py", line 28, in main model = load_model(args.model_directory, args.device, weights=int(args.weights)) File "/home/torben/opt/bonito/bonito/util.py", line 314, in load_model model.to(device) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 186, in _apply self.flatten_parameters() File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 172, in flatten_parameters torch._cudnn_rnn_flatten_weight( RuntimeError: CUDA error: no kernel image is available for execution on the device

tnn111 avatar Apr 02 '21 02:04 tnn111

I found a solution. Here it is for posterity.

Create the usual separate environment for this.

i) Install the nightly build of PyTorch (version 1.9). I used version 11.1 of the toolkit, but you could probably get away with 11.2 as well. ii) Fix seqdist to accept cupy-111 and install it with pip. iii) Install bonito using the developer instructions.

I’m basecalling now. Throughput is a bit below 100 reads/s which is less than I’d hoped.

Anyone have suggestions for ways of making it faster on an A100?

On Apr 1, 2021, at 19:14, Torben Nielsen @.***> wrote:

Anyone else encounter this? Thanks.

loading model ~/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/cuda/init.py:104: UserWarning: A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) Traceback (most recent call last): File "/home/torben/opt/anaconda3/envs/bonito/bin/bonito", line 33, in sys.exit(load_entry_point('ont-bonito', 'console_scripts', 'bonito')()) File "/home/torben/opt/bonito/bonito/init.py", line 39, in main args.func(args) File "/home/torben/opt/bonito/bonito/cli/basecaller.py", line 28, in main model = load_model(args.model_directory, args.device, weights=int(args.weights)) File "/home/torben/opt/bonito/bonito/util.py", line 314, in load_model model.to(device) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 186, in _apply self.flatten_parameters() File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 172, in flatten_parameters torch._cudnn_rnn_flatten_weight( RuntimeError: CUDA error: no kernel image is available for execution on the device

tnn111 avatar Apr 02 '21 02:04 tnn111

Here’s the window for the PyTorch installation:

On Apr 1, 2021, at 19:45, Torben Nielsen @.***> wrote:

I found a solution. Here it is for posterity.

Create the usual separate environment for this.

i) Install the nightly build of PyTorch (version 1.9). I used version 11.1 of the toolkit, but you could probably get away with 11.2 as well. ii) Fix seqdist to accept cupy-111 and install it with pip. iii) Install bonito using the developer instructions.

I’m basecalling now. Throughput is a bit below 100 reads/s which is less than I’d hoped.

Anyone have suggestions for ways of making it faster on an A100?

On Apr 1, 2021, at 19:14, Torben Nielsen @.***> wrote:

Anyone else encounter this? Thanks.

loading model ~/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/cuda/init.py:104: UserWarning: A100-PCIE-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the A100-PCIE-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) Traceback (most recent call last): File "/home/torben/opt/anaconda3/envs/bonito/bin/bonito", line 33, in sys.exit(load_entry_point('ont-bonito', 'console_scripts', 'bonito')()) File "/home/torben/opt/bonito/bonito/init.py", line 39, in main args.func(args) File "/home/torben/opt/bonito/bonito/cli/basecaller.py", line 28, in main model = load_model(args.model_directory, args.device, weights=int(args.weights)) File "/home/torben/opt/bonito/bonito/util.py", line 314, in load_model model.to(device) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 186, in _apply self.flatten_parameters() File "/home/torben/opt/anaconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 172, in flatten_parameters torch._cudnn_rnn_flatten_weight( RuntimeError: CUDA error: no kernel image is available for execution on the device

tnn111 avatar Apr 02 '21 02:04 tnn111

I'm getting multi-GPU training failure also (v0.4.0 cuda 11.1), with the following error message

[loading data]
[validation set not found: splitting training set]
[loading model]
Traceback (most recent call last):
  File "/hps/software/users/iqbal/mbhall/miniconda3/envs/bonito/bin/bonito", line 33, in <module>
    sys.exit(load_entry_point('ont-bonito-cuda111', 'console_scripts', 'bonito')())
  File "/hps/nobackup/iqbal/mbhall/tubby/bonito/bonito/__init__.py", line 34, in main
    args.func(args)
  File "/hps/nobackup/iqbal/mbhall/tubby/bonito/bonito/cli/train.py", line 83, in main
    trainer = Trainer(model, device, train_loader, valid_loader, use_amp=half_supported() and not args.no_amp)
  File "/hps/nobackup/iqbal/mbhall/tubby/bonito/bonito/training.py", line 107, in __init__
    self.criterion = criterion or (model.seqdist.ctc_loss if hasattr(model, 'seqdist') else model.ctc_label_smoothing_loss)
  File "/hps/software/users/iqbal/mbhall/miniconda3/envs/bonito/lib/python3.8/site-packages/torch/nn/modules/module.py", line 947, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'ctc_label_smoothing_loss'

mbhall88 avatar Nov 11 '21 05:11 mbhall88