DNABERT_2
DNABERT_2 copied to clipboard
RuntimeError: Triton Error [CUDA]: invalid argument by run_dnabert2.sh
What is the problem and the solution??
The provided data_path is /home/shiro/DNABERT_2/finetune
2023-08-31 17:57:18.856636: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/cuda/lib64:
2023-08-31 17:57:18.856685: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:root:Perform single sequence classification...
WARNING:root:Perform single sequence classification...
WARNING:root:Perform single sequence classification...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers
before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Some weights of the model checkpoint at zhihan1996/DNABERT-2-117M were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNABERT-2-117M and are newly initialized: ['classifier.weight', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using cuda_amp half precision backend
***** Running training *****
Num examples = 36,496
Num Epochs = 5
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 4
Total optimization steps = 5,700
Number of trainable parameters = 117,070,851
0%| | 0/5700 [00:00<?, ?it/s]Traceback (most recent call last):
File "
", line 21, in _bwd_kernel KeyError: ('2-.-0-.-0-1e8410f206c822547fb50e2ea86e45a6-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-42648570729a4835b21c1c18cebedbfe-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, torch.float32, torch.float16, torch.float32, torch.float16, torch.float16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('matrix', False, 64, False, False, False, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False)))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train_modified.py", line 332, in
Here are my installed packages.
(dnabert2) shiro@GTUNE:~/DNABERT_2/finetune$ pip list Package Version
absl-py 1.0.0 accelerate 0.22.0 anndata 0.7.6 antlr4-python3-runtime 4.9.3 appdirs 1.4.4 astor 0.8.1 astunparse 1.6.3 autograd 1.4 autograd-gamma 0.5.0 biopython 1.79 biothings-client 0.2.6 bleach 5.0.1 Brotli 1.0.9 cachetools 5.0.0 certifi 2023.7.22 charset-normalizer 2.0.12 click 8.1.2 cmake 3.27.2 coloredlogs 15.0.1 cycler 0.11.0 dash 2.0.0 dash-core-components 2.0.0 dash-dangerously-set-inner-html 0.0.2 dash-html-components 2.0.0 dash-table 5.0.0 docutils 0.19 einops 0.6.1 filelock 3.12.3 Flask 2.1.1 Flask-Compress 1.11 fonttools 4.32.0 formulaic 0.2.4 fsspec 2023.6.0 future 0.18.2 gast 0.3.3 google-auth 2.6.5 google-auth-oauthlib 0.4.6 google-pasta 0.2.0 grpcio 1.44.0 h5py 2.10.0 huggingface-hub 0.16.4 humanfriendly 10.0 idna 3.4 importlib-metadata 4.11.3 interface-meta 1.3.0 itsdangerous 2.1.2 Jinja2 3.1.1 joblib 1.1.0 Keras-Preprocessing 1.1.2 kiwisolver 1.4.2 lifelines 0.26.4 lit 17.0.0rc3 llvmlite 0.36.0 Markdown 3.3.6 markdown-it-py 2.1.0 MarkupSafe 2.1.1 matplotlib 3.5.1 mdurl 0.1.2 mhcflurry 2.0.5 mhcgnomes 1.7.0 mygene 3.2.2 natsort 8.1.0 np-utils 0.6.0 numba 0.53.0 numpy 1.18.5 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 omegaconf 2.3.0 opt-einsum 3.3.0 packaging 21.3 pandas 1.3.4 patsy 0.5.2 peft 0.3.0 Pillow 9.1.0 pip 23.2.1 pkginfo 1.9.6 plotly 5.4.0 protobuf 3.20.0 psutil 5.9.5 Pygments 2.14.0 pynndescent 0.5.6 pyparsing 3.0.8 python-dateutil 2.8.2 pytz 2022.1 PyYAML 6.0.1 readme-renderer 37.3 regex 2023.8.8 requests 2.26.0 requests-oauthlib 1.3.1 requests-toolbelt 0.10.1 rfc3986 2.0.0 rich 13.2.0 rsa 4.8 safetensors 0.3.3 scikit-learn 1.0.2 scipy 1.4.1 seaborn 0.11.2 serializable 0.2.1 setuptools 68.0.0 six 1.16.0 SNAF 0.5.2 statsmodels 0.13.1 tenacity 8.0.1 tensorboard 2.8.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorboardX 2.6.2.2 tensorflow 2.3.0 tensorflow-estimator 2.3.0 termcolor 1.1.0 threadpoolctl 3.1.0 tokenizers 0.13.3 torch 1.13.0 torchaudio 0.13.0 torchvision 0.14.0 tqdm 4.62.3 transformers 4.29.2 triton 2.0.0.dev20221202 twine 4.0.2 typechecks 0.1.0 typing_extensions 4.7.1 umap-learn 0.5.2 urllib3 1.26.14 webencodings 0.5.1 Werkzeug 2.0.2 wheel 0.38.4 wrapt 1.14.0 xlrd 1.2.0 xmltodict 0.12.0 xmltramp2 3.1.1
Here is my GPU implementation.
(dnabert2) shiro@GTUNE:~$ nvidia-smi -L GPU 0: NVIDIA GeForce RTX 3070 Laptop GPU (UUID: GPU-776edd0d-aef5-ab3a-3750-32bfa854fecf)
(dnabert2) shiro@GTUNE:~$ /usr/local/cuda/bin/nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_Mar__8_18:18:20_PST_2022 Cuda compilation tools, release 11.6, V11.6.124 Build cuda_11.6.r11.6/compiler.31057947_0
(dnabert2) shiro@GTUNE:~$ dpkg -l | grep cudnn ii cudnn-local-repo-ubuntu2004-8.9.4.25 1.0-1 amd64 cudnn-local repository configuration files ii libcudnn8 8.9.4.25-1+cuda11.8 amd64 cuDNN runtime libraries ii libcudnn8-dev 8.9.4.25-1+cuda11.8 amd64 cuDNN development libraries and headers
I have the same error KeyError: ('2-.-0-.-0-1e8410f206c822547fb50e2ea86e45a6-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-42648570729a4835b21c1c18cebedbfe-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, torch.float32, torch.float16, torch.float32, torch.float16, torch.float16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('matrix', False, 64, False, False, False, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False)))
Do you solve it?Thanks!
I just gave up..... Sorry.