bonito icon indicating copy to clipboard operation
bonito copied to clipboard

RuntimeError: CUDA error: an illegal memory access was encountered

Open kuanchiun opened this issue 3 years ago • 12 comments

Hello!

I just got the problem about CUDA error. I install bonito with pip install -f https://download.pytorch.org/whl/torch_stable.html ont-bonito-cuda113

Here is the code what I use. bonito train --epochs 20 --pretrained [email protected] --batch 10 --directory training/ctc-data training/fine_tuned_model

'''''''''''''''' Here is the full traceback.

[loading data]

[validation set not found: splitting training set] [loading model] [using pretrained model [email protected]] [0/53]: 0%| | [00:00]Error - an illegal memory access was encountered

Traceback (most recent call last): File "/data/home/R421/anaconda3/envs/bonito/bin/bonito", line 8, in sys.exit(main()) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/init.py", line 34, in main args.func(args) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/cli/train.py", line 97, in main trainer.fit(workdir, args.epochs, lr) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/training.py", line 210, in fit train_loss, duration = self.train_one_epoch(loss_log, lr_scheduler) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/training.py", line 135, in train_one_epoch losses, grad_norm = self.train_one_step(batch) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/training.py", line 99, in train_one_step losses_ = self.criterion(scores_, targets_, lengths_) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/crf/model.py", line 177, in loss return self.seqdist.ctc_loss(scores.to(torch.float32), targets, target_lengths, **kwargs) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/bonito/crf/model.py", line 122, in ctc_loss logz = logZ_cu(stay_scores, move_scores, target_lengths + 1 - self.state_len) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/koi/ctc.py", line 115, in logZ_cu return LogZ.apply(stay_scores, move_scores, target_lengths, _simple_lattice_fwd_bwd_cu, S) File "/data/home/R421/anaconda3/envs/bonito/lib/python3.9/site-packages/koi/ctc.py", line 53, in forward g = S.dsum(torch.cat([S.mul(alpha[:-1], beta_stay), S.mul(alpha[:-1], beta_move)], dim=2), dim=2) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

'''''''''''''''''''''''

And I use RTX3090 with Driver version 470.103.01 and CUDA 11.3, here is the NVCC output:

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Mar_21_19:15:46_PDT_2021 Cuda compilation tools, release 11.3, V11.3.58 Build cuda_11.3.r11.3/compiler.29745058_0

Is there any suggestion? Thanks!

kuanchiun avatar Feb 23 '22 07:02 kuanchiun

I'm getting the same error from bonito train.

I installed the latest version of bonito:

pip install --upgrade pip
pip install -f https://download.pytorch.org/whl/torch_stable.html ont-bonito-cuda113

bonito --version
bonito 0.5.1

I'm training a model using my own reads. After the basecaller step that included --save-ctc, I ran the following: bonito train --directory ./training/ctc-data training/model-dir

Note: Basecalling wouldn't work on our P100 devices but ran successfully on V100.

All our cluster's V100s have CUDA 11.5. Is there forward compatibility for the 11.3 build? Seems like CUDA 11.5 isn't the issue, as @89213385(https://github.com/89213385) is using CUDA 11.3 and encountering the same error.

More device/driver info:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

nvidia-smi
Wed Feb 23 16:17:42 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   28C    P0    26W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Any suggestions? Thank you.

LRFreeborn avatar Feb 23 '22 21:02 LRFreeborn

I am encountering the same error with 1080TI with the following specs

Fri Feb 25 10:38:12 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
| 27%   29C    P8     6W / 180W |   1451MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:03:00.0 Off |                  N/A |
| 28%   29C    P8     7W / 180W |    109MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:81:00.0 Off |                  N/A |
| 27%   25C    P8     7W / 180W |      4MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:82:00.0 Off |                  N/A |
| 27%   26C    P8     6W / 180W |      4MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

chrishendra93 avatar Feb 25 '22 02:02 chrishendra93

Same exact error as OP (specifically, this line: g = S.dsum(torch.cat([S.mul(alpha[:-1], beta_stay), S.mul(alpha[:-1], beta_move)], dim=2), dim=2), followed by RuntimeError: CUDA error: an illegal memory access was encountered) on Tesla V100 (CUDA 10.2)

my script: bonito train --directory converted_hdf5/ --config bonito/bonito/models/configs/[email protected] custom_model_dir

tcb72 avatar Mar 06 '22 15:03 tcb72

hi @tcb72 , @chrishendra93 , @LRFreeborn @89213385 - would it be possible for one of you to provide data for me to reproduce this problem?

vellamike avatar Mar 06 '22 22:03 vellamike

Hello @vellamike, The data what I use is partitions 225 on https://github.com/marbl/CHM13.

I tried to repeat the model training which mentioned in https://www.biorxiv.org/content/10.1101/2022.01.11.475254v1.full

First, I used bonito basecaller [email protected] partitions225_fast5 > partitions225.fastq Next, I used minimap2 to check the location of reads on chm13_v1.1 reference genome, and filter the reads which located in 10kb on the heads and tail of any chromosome. Last, I extracted the single fast5 of those reads and merged into one fast5 file, run bonito basecaller and bonito train, and then I got the error.

For the past few days, I have tried training with the full sample without filtering the reads and got no errors. Perhaps the error is due to too few sample reads given?

Thank you.

kuanchiun avatar Mar 07 '22 04:03 kuanchiun

Same exact error here.

And there is another thing that confuses me.

Since @89213385 mentioned about #reads, I checked the output of "bonito basecaller --save-ctc".

The training set contains ~800 fast5 files, with 4000 reads/fast5. However, the yielded sam/tsv/npy files only contain ~9000 records.

Meanwhile, I tried to run"bonito basecaller" without the flag "--save-ctc", and could recover almost all the reads.

Any insights? Thank you guys!

hd2326 avatar Mar 26 '22 21:03 hd2326

Same issue. using CTC data from just some 40000 reads (first 10 fast5 files as the whole set was going out of memory on the p3.2x.large AWS instance)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

dottorscaglione avatar Apr 05 '22 17:04 dottorscaglione

Hi, I've managed to solve this issue by just reducing the batch size. Seems like this is an out of memory error for me that is somehow being reported as illegal memory access

chrishendra93 avatar May 20 '22 07:05 chrishendra93

Hi, I'm experiencing the same issue by running Bonito under Windows 10 WSL2-Ubuntu 22.04 LTS with a NVIDIA RTX 3080 10 GB GPU. I tried to reduce the batch size but it did not solve the issue. Any idea?

Config: example

N0toriou5 avatar Jun 16 '22 09:06 N0toriou5

Hi, I've managed to solve this issue by just reducing the batch size. Seems like this is an out of memory error for me that is somehow being reported as illegal memory access

What did you reduce batch size to?

vellamike avatar Jun 30 '22 14:06 vellamike

Last, I extracted the single fast5 of those reads and merged into one fast5 file, run bonito basecaller and bonito train, and then I got the error.

@89213385 would it be possible for you to share this fast5 with me? It would help me diagnose the issue.

vellamike avatar Jun 30 '22 14:06 vellamike

This issue is now fixed (see my comment in #275 for instructions on how to get a version of Bonito which solves the problem - you will need to install Bonito from source but this is simple).

vellamike avatar Jul 13 '22 15:07 vellamike