llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Eval `gpt2` fails with `CUBLAS_STATUS_NOT_INITIALIZED`

Open matthiasgeihs opened this issue 1 year ago • 0 comments

Running

python eval/eval.py eval/yamls/hf_eval.yaml icl_tasks=eval/yamls/winograd.yaml model_name_or_path=gpt2

fails with

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

The same command runs fine with other models (e.g., EleutherAI/gpt-neo-125M). Any ideas what could be going wrong in the case of gpt2?

Environment


System Environment Report
Created: 2023-06-22 15:38:44 UTC

PyTorch information

PyTorch version: 1.13.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.26.3 Libc version: glibc-2.31

Python version: 3.10.11 (main, Apr 5 2023, 14:15:10) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.19.0-1026-gcp-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.7.99 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB GPU 1: NVIDIA A100-SXM4-40GB

Nvidia driver version: 520.61.05 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.24.2 [pip3] pytorch-ranger==0.1.1 [pip3] torch==1.13.1+cu117 [pip3] torch-optimizer==0.3.0 [pip3] torchmetrics==0.11.3 [pip3] torchtext==0.14.1 [pip3] torchvision==0.14.1+cu117 [conda] Could not collect

Composer information

Composer version: 0.14.1 Composer commit hash: None Host processor model name: Intel(R) Xeon(R) CPU @ 2.20GHz Host processor core count: 12 Number of nodes: 1 Accelerator model name: NVIDIA A100-SXM4-40GB Accelerators per node: 1 CUDA Device Count: 2

matthiasgeihs avatar Jun 22 '23 15:06 matthiasgeihs