DiT icon indicating copy to clipboard operation
DiT copied to clipboard

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul

Open flytomatolll opened this issue 1 year ago • 6 comments

I'm sorry to bother you. I first run train.py in my own dataset and get a xxx.pt. Then I use the xxx.pt to run sample.py. But I got this: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())``

do you know how to fix it? thank you

flytomatolll avatar Mar 20 '23 01:03 flytomatolll

by the way, my dataset is only 5 classes, so I changed it in your code. Should I change something else?

flytomatolll avatar Mar 20 '23 01:03 flytomatolll

I have the same confusion. I also tried running the file directly with the pre-trained DiT model provided by the author, and it worked. So I'm guessing something went wrong in training.

vivi7017 avatar Mar 21 '23 08:03 vivi7017

I have the same confusion. I also tried running the file directly with the pre-trained DiT model provided by the author, and it worked. So I'm guessing something went wrong in training.

I have solved this problem, because I forget to change the num_classes in the y_embedder. If you run the pre-trained model from the author, I think maybe it's a different problem with me. And I also tried to run this model but I didn't have the enough memory...

flytomatolll avatar Mar 22 '23 11:03 flytomatolll

Thank you very much. I have solved this problem, too. I made a similar mistake.

vivi7017 avatar Mar 22 '23 12:03 vivi7017

Hello, can you tell me how to operate it? I had the same problem,my dataset is only 1 class.Thanks!

yh-xxx avatar Nov 10 '23 09:11 yh-xxx

@yh-xxx You must have solved the problem, if somebody else need:

change 1000 to 1 here: https://github.com/facebookresearch/DiT/blob/main/sample.py#L56

for example:

before: y_null = torch.tensor([1000] * n, device=device)

after: y_null = torch.tensor([args.num_classes] * n, device=device)

NrealWJX avatar May 25 '24 10:05 NrealWJX