RecBole icon indicating copy to clipboard operation
RecBole copied to clipboard

[🐛BUG] full_sort_scores cuda error

Open IFShirokikh opened this issue 1 year ago • 3 comments

Describe the bug I am able to train the model, but I cannot get predictions on the test sample.

To Reproduce I'm attaching to https://drive.google.com/drive/folders/1YLS0R41sWbDvL3_CxEsSmc9n0UbNXwbH:

  1. "hh.yaml"
  2. jupyter notebook "Recbole example.ipynb" with error (I stopped training after 1 epoch to reproduce the error faster)
  3. data for training: "hh_recbole"
  4. saved model: "saved"

Expected behavior I wanted to reproduce https://recbole.io/docs/user_guide/usage/case_study.html

Screenshots image image

Desktop:

  • OS Linux
  • RecBole Version 1.2.0
  • Python Version 3.9.18
  • PyTorch Version 2.0.1
  • cudatoolkit Version 11.0

IFShirokikh avatar Jan 10 '24 13:01 IFShirokikh

Further restarts of the error cell lead to the following result: image

IFShirokikh avatar Jan 10 '24 13:01 IFShirokikh

A similar error occurred during several epochs when the model tried to load the last most successful attempt. Therefore, the problem has become critical - it is impossible not to train or test the model train log.txt

IFShirokikh avatar Jan 11 '24 07:01 IFShirokikh

Thanks for your attention to RecBole! As for your problem, you can try advice below.

  1. CUDA Compatibility: Ensure that your GPU is CUDA-compatible and check if your GPU is listed in the official PyTorch CUDA support documentation https://pytorch.org/get-started/previous-versions/.
  2. PyTorch Installation: Verify that you have installed the correct version of PyTorch that corresponds to your CUDA version. Hope this could help you!

BoXiaohe avatar Jan 20 '24 12:01 BoXiaohe