bonito
bonito copied to clipboard
RuntimeError: CUDA error: an illegal memory access was encountered
Hello!
I just got the problem about CUDA error.
I install bonito with pip install -f https://download.pytorch.org/whl/torch_stable.html ont-bonito-cuda113
Here is the code what I use.
bonito train --epochs 20 --pretrained [email protected] --batch 10 --directory training/ctc-data training/fine_tuned_model
'''''''''''''''' Here is the full traceback.
[loading data]
[validation set not found: splitting training set] [loading model] [using pretrained model [email protected]] [0/53]: 0%| | [00:00]Error - an illegal memory access was encountered
Traceback (most recent call last): File "/data/home/R421/anaconda3/envs/bonito/bin/bonito", line 8, in
sys.exit(main()) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/init.py", line 34, in main args.func(args) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/cli/train.py", line 97, in main trainer.fit(workdir, args.epochs, lr) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/training.py", line 210, in fit train_loss, duration = self.train_one_epoch(loss_log, lr_scheduler) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/training.py", line 135, in train_one_epoch losses, grad_norm = self.train_one_step(batch) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/training.py", line 99, in train_one_step losses_ = self.criterion(scores_, targets_, lengths_) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/crf/model.py", line 177, in loss return self.seqdist.ctc_loss(scores.to(torch.float32), targets, target_lengths, **kwargs) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/crf/model.py", line 122, in ctc_loss logz = logZ_cu(stay_scores, move_scores, target_lengths + 1 - self.state_len) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/koi/ctc.py", line 115, in logZ_cu return LogZ.apply(stay_scores, move_scores, target_lengths, _simple_lattice_fwd_bwd_cu, S) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/koi/ctc.py", line 53, in forward g = S.dsum(torch.cat([S.mul(alpha[:-1], beta_stay), S.mul(alpha[:-1], beta_move)], dim=2), dim=2) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
'''''''''''''''''''''''
And I use RTX3090 with Driver version 470.103.01 and CUDA 11.3, here is the NVCC output:
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Mar_21_19:15:46_PDT_2021 Cuda compilation tools, release 11.3, V11.3.58 Build cuda_11.3.r11.3/compiler.29745058_0
Is there any suggestion? Thanks!
I'm getting the same error from bonito train.
I installed the latest version of bonito:
pip install --upgrade pip
pip install -f https://download.pytorch.org/whl/torch_stable.html ont-bonito-cuda113
bonito --version
bonito 0.5.1
I'm training a model using my own reads. After the basecaller step that included --save-ctc, I ran the following:
bonito train --directory ./training/ctc-data training/model-dir
Note: Basecalling wouldn't work on our P100 devices but ran successfully on V100.
All our cluster's V100s have CUDA 11.5. Is there forward compatibility for the 11.3 build? Seems like CUDA 11.5 isn't the issue, as @89213385(https://github.com/89213385) is using CUDA 11.3 and encountering the same error.
More device/driver info:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
nvidia-smi
Wed Feb 23 16:17:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:88:00.0 Off | 0 |
| N/A 28C P0 26W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Any suggestions? Thank you.
I am encountering the same error with 1080TI with the following specs
Fri Feb 25 10:38:12 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:02:00.0 Off | N/A |
| 27% 29C P8 6W / 180W | 1451MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:03:00.0 Off | N/A |
| 28% 29C P8 7W / 180W | 109MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... On | 00000000:81:00.0 Off | N/A |
| 27% 25C P8 7W / 180W | 4MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... On | 00000000:82:00.0 Off | N/A |
| 27% 26C P8 6W / 180W | 4MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Same exact error as OP (specifically, this line: g = S.dsum(torch.cat([S.mul(alpha[:-1], beta_stay), S.mul(alpha[:-1], beta_move)], dim=2), dim=2)
, followed by RuntimeError: CUDA error: an illegal memory access was encountered
) on Tesla V100 (CUDA 10.2)
my script: bonito train --directory converted_hdf5/ --config bonito/bonito/models/configs/[email protected] custom_model_dir
hi @tcb72 , @chrishendra93 , @LRFreeborn @89213385 - would it be possible for one of you to provide data for me to reproduce this problem?
Hello @vellamike, The data what I use is partitions 225 on https://github.com/marbl/CHM13.
I tried to repeat the model training which mentioned in https://www.biorxiv.org/content/10.1101/2022.01.11.475254v1.full
First, I used bonito basecaller [email protected] partitions225_fast5 > partitions225.fastq
Next, I used minimap2
to check the location of reads on chm13_v1.1 reference genome, and filter the reads which located in 10kb on the heads and tail of any chromosome.
Last, I extracted the single fast5 of those reads and merged into one fast5 file, run bonito basecaller
and bonito train
, and then I got the error.
For the past few days, I have tried training with the full sample without filtering the reads and got no errors. Perhaps the error is due to too few sample reads given?
Thank you.
Same exact error here.
And there is another thing that confuses me.
Since @89213385 mentioned about #reads, I checked the output of "bonito basecaller --save-ctc".
The training set contains ~800 fast5 files, with 4000 reads/fast5. However, the yielded sam/tsv/npy files only contain ~9000 records.
Meanwhile, I tried to run"bonito basecaller" without the flag "--save-ctc", and could recover almost all the reads.
Any insights? Thank you guys!
Same issue. using CTC data from just some 40000 reads (first 10 fast5 files as the whole set was going out of memory on the p3.2x.large AWS instance)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54 Driver Version: 510.54 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
Hi, I've managed to solve this issue by just reducing the batch size. Seems like this is an out of memory error for me that is somehow being reported as illegal memory access
Hi, I'm experiencing the same issue by running Bonito under Windows 10 WSL2-Ubuntu 22.04 LTS with a NVIDIA RTX 3080 10 GB GPU. I tried to reduce the batch size but it did not solve the issue. Any idea?
Config:
Hi, I've managed to solve this issue by just reducing the batch size. Seems like this is an out of memory error for me that is somehow being reported as illegal memory access
What did you reduce batch size to?
Last, I extracted the single fast5 of those reads and merged into one fast5 file, run
bonito basecaller
andbonito train
, and then I got the error.
@89213385 would it be possible for you to share this fast5 with me? It would help me diagnose the issue.
This issue is now fixed (see my comment in #275 for instructions on how to get a version of Bonito which solves the problem - you will need to install Bonito from source but this is simple).