simpletransformers
simpletransformers copied to clipboard
CUDA error: CUBLAS_STATUS_EXECUTION_FAILED during training a Multiclass Model with weights
Describe the bug During training a Multiclass Classification Model with weights a cuda error is thrown. The error appears everytime at 39% of the first epoch.
[/usr/local/lib/python3.9/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
195 # some Python versions print out the first line of a multi-line function
196 # calls in the traceback and some print out the last line
--> 197 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
198 tensors, grad_tensors_, retain_graph, create_graph, inputs,
199 allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)
To Reproduce I'using the linked train.json file to train the model using the code below.
train_df = pd.read_json(f"train.json")
train_df = train_df[['labels', 'text']]
Y = train_df['labels']
classes = np.unique(Y)
class_weights = compute_class_weight('balanced',classes=classes, y=Y).tolist()
num_labels = len(list(train_df["labels"].unique()))
model = ClassificationModel(
"roberta",
"ehsanaghaei/SecureBERT",
num_labels=num_labels,
weight=class_weights
)
model.train_model(train_df)
Expected behavior No CUDA Error or a clear error message what i'm doing wrong.
Desktop (please complete the following information):
- google colab, premium GPU and Standard GPU
It also happens when using class_weights = compute_class_weight(None,classes=classes, y=Y).tolist()
If i use this, the weights is an array with [1,1,1,...]
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.