bonito
bonito copied to clipboard
Bonito model training using WSL2: "RuntimeError: CUDA error: unknown error"
Hallo everybody,
As Windows 11 and Windows 10, version 21H2 support PyTorch using NVIDIA CUDA for GPU hardware acceleration inside WSL2 (https://docs.microsoft.com/de-de/windows/ai/directml/gpu-cuda-in-wsl), we wanted to test running bonito. Bonito basecalling worked perfectly with the following command bonito basecaller [email protected] --recursive Fast5_files/ > basecalls.fastq
. The Ubuntu 20.04.4 LTS app needed to be opened with administrator privileges to get it started.
As a next step we wanted to train a pre-existing model with our own data. First we re-basecalled the data using the following command:
bonito basecaller [email protected] --save-ctc --reference /home/domi/reference_genomes/reference.mmi /home/domi/bonito_training/fast5 > /home/domi/bonito_training/basecalls.sam
It also worked perfectly.
However, running the following command resulted in a CUDA error:
bonito train --epochs 1 --lr 5e-4 --pretrained [email protected] --directory /home/domi/bonito_training/ctc-data /home/domi/bonito_training/fine-tuned-model
[loading data] [validation set not found: splitting training set] [loading model] [using pretrained model [email protected]] [0/166161]: 0%| | [00:00] Traceback (most recent call last): File "/home/domi/.local/bin/bonito", line 8, in
sys.exit(main()) File "/home/domi/.local/lib/python3.8/site-packages/bonito/init.py", line 34, in main args.func(args) File "/home/domi/.local/lib/python3.8/site-packages/bonito/cli/train.py", line 97, in main trainer.fit(workdir, args.epochs, lr) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 210, in fit train_loss, duration = self.train_one_epoch(loss_log, lr_scheduler) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 135, in train_one_epoch losses, grad_norm = self.train_one_step(batch) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 98, in train_one_step scores_ = self.model(data_) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/domi/.local/lib/python3.8/site-packages/bonito/crf/model.py", line 166, in forward return self.encoder(x) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/domi/.local/lib/python3.8/site-packages/bonito/nn.py", line 41, in forward return super().forward(x) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/domi/.local/lib/python3.8/site-packages/bonito/nn.py", line 178, in forward y, h = self.rnn(x) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/domi/.local/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 691, in forward result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers, RuntimeError: CUDA error: unknown error CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Running CUDA_LAUNCH_BLOCKING=1 bonito train --epochs 1 --lr 5e-4 --pretrained [email protected] --directory /home/domi/bonito_training/ctc-data /home/domi/bonito_training/fine-tuned-model
did not solve the issue.
Probably the error is connected to issue #233 that has been reported earlier.
I am thankful for any suggestions on how to solve this issue.
I use the following GPU: NVIDIA GeForce RTX3060 Ti.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54 Driver Version: 512.15 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 0% 47C P8 12W / 220W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Update: I was able to exactly reproduce the error described in issue #233 by empirically lowering the batch size to 19.
bonito train --epochs 1 --lr 5e-4 --pretrained [email protected] --directory /home/domi/bonito_training/ctc-data /home/domi/bonito_training/fine-tuned-model -f --batch 19
[loading data] [validation set not found: splitting training set] [loading model] [using pretrained model [email protected]] [0/166161]: 0%| | [00:00]Error - an illegal memory access was encountered
Traceback (most recent call last): File "/home/domi/.local/bin/bonito", line 8, in
sys.exit(main()) File "/home/domi/.local/lib/python3.8/site-packages/bonito/init.py", line 34, in main args.func(args) File "/home/domi/.local/lib/python3.8/site-packages/bonito/cli/train.py", line 97, in main trainer.fit(workdir, args.epochs, lr) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 210, in fit train_loss, duration = self.train_one_epoch(loss_log, lr_scheduler) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 135, in train_one_epoch losses, grad_norm = self.train_one_step(batch) File "/home/domi/.local/lib/python3.8/site-packages/bonito/training.py", line 99, in train_one_step losses_ = self.criterion(scores_, targets_, lengths_) File "/home/domi/.local/lib/python3.8/site-packages/bonito/crf/model.py", line 177, in loss return self.seqdist.ctc_loss(scores.to(torch.float32), targets, target_lengths, **kwargs) File "/home/domi/.local/lib/python3.8/site-packages/bonito/crf/model.py", line 122, in ctc_loss logz = logZ_cu(stay_scores, move_scores, target_lengths + 1 - self.state_len) File "/home/domi/.local/lib/python3.8/site-packages/koi/ctc.py", line 115, in logZ_cu return LogZ.apply(stay_scores, move_scores, target_lengths, _simple_lattice_fwd_bwd_cu, S) File "/home/domi/.local/lib/python3.8/site-packages/koi/ctc.py", line 53, in forward g = S.dsum(torch.cat([S.mul(alpha[:-1], beta_stay), S.mul(alpha[:-1], beta_move)], dim=2), dim=2) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Batch sizes smaller than 19 also resulted in the same error, batch sizes greater than 20 resulted in the "RuntimeError: CUDA error: unknown error" described in my original post.
I got the same error running Bonito train with the following basic command:
bonito train --epochs 1 --lr 5e-4 --pretrained [email protected] --directory ctc-data fine-tuned-model
on a WSL2 Ubuntu 20.04.4 LTS, Windows 10 21H2, buld 19044.1706. Did you manage to solve this issue?
My CUDA compilation tool is updated to the V11.7.64 release, and I am using the following GPU: NVIDIA GeForce RTX3080 10 GB.
@N0toriou5 Unfortunately, I have not been able to solve the problem, yet. I am still hoping for a solution provided by ONT as I have the feeling that many users are experiencing the same problem.
@jhammery @N0toriou5 this issue is now fixed (see my comment in #275 for instructions on how to get a version of Bonito which solves the problem - you will need to install Bonito from source but this is simple).