simpletransformers icon indicating copy to clipboard operation
simpletransformers copied to clipboard

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED during training a Multiclass Model with weights

Open simonkleinfeld opened this issue 1 year ago • 2 comments

Describe the bug During training a Multiclass Classification Model with weights a cuda error is thrown. The error appears everytime at 39% of the first epoch.

[/usr/local/lib/python3.9/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    195     # some Python versions print out the first line of a multi-line function
    196     # calls in the traceback and some print out the last line
--> 197     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    198         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    199         allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

To Reproduce I'using the linked train.json file to train the model using the code below.

train_df = pd.read_json(f"train.json")
train_df = train_df[['labels', 'text']]

Y = train_df['labels']
classes = np.unique(Y)
class_weights = compute_class_weight('balanced',classes=classes, y=Y).tolist()
num_labels = len(list(train_df["labels"].unique()))

model = ClassificationModel(
        "roberta",
        "ehsanaghaei/SecureBERT",
        num_labels=num_labels,
        weight=class_weights 
    )
model.train_model(train_df)

Expected behavior No CUDA Error or a clear error message what i'm doing wrong.

Desktop (please complete the following information):

  • google colab, premium GPU and Standard GPU

simonkleinfeld avatar Mar 13 '23 09:03 simonkleinfeld

It also happens when using class_weights = compute_class_weight(None,classes=classes, y=Y).tolist() If i use this, the weights is an array with [1,1,1,...]

simonkleinfeld avatar Mar 13 '23 16:03 simonkleinfeld

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 18 '23 06:06 stale[bot]