s4 icon indicating copy to clipboard operation
s4 copied to clipboard

Error after installing CUDA extension for Cauchy multiplication

Open gitbooo opened this issue 2 years ago • 13 comments

I'm trying to reproduce experiments but the code is retuning a KeyError 'nvrtc' and the warning [src.models.sequence.ss.kernel][WARNING] - CUDA extension for Cauchy multiplication not found still appearing.

Otherwise, I'm getting this erreur : Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS

gitbooo avatar Aug 26 '22 20:08 gitbooo

Can you elaborate on the 'nvrtc' error? Can you uninstall and reinstall the extension (pip uninstall cauchy-mult and cd extensions/cauchy && python setup.py install) and copy what it prints?

Does the code run if you completely uninstall the extension? What about if you install pykeops?

albertfgu avatar Aug 29 '22 19:08 albertfgu

After doing multiple tests, I realized that the cauchy extension is not the problem (although it is strange that even after installing the extension, the code still returns "CUDA extension for cauchy multiplication not found"), but it is the second error that I cannot resolve:

Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7f8411eaaf06]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7f8411ea28e5]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7f8411dc7e09]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f8411eaba3d]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7f8411dc5948]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f8411eaba3d]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7f8411d80b46]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7f84117e546a]
/lib/x86_64-linux-gnu/libc.so.6(+0x43161) [0x7f84893fb161]
/lib/x86_64-linux-gnu/libc.so.6(+0x4325a) [0x7f84893fb25a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7f84893d9bfe]
python(+0x2125d4) [0x564a6c3b15d4]
/var/spool/slurmd/job2192901/slurm_script: line 18: 17719 Aborted


gitbooo avatar Aug 29 '22 19:08 gitbooo

I haven't seen this error before. Just to confirm, this happens even with the extension uninstalled? Does your environment work with other codebases? Outside of the extension, there is nothing fancy with requirements for this repository.

albertfgu avatar Aug 29 '22 19:08 albertfgu

Yeah, the extension is not installed. however I'm getting this error at the end of training after epoch 9 is finished.

gitbooo avatar Aug 29 '22 20:08 gitbooo

Hi, I also have the same error, at the end of the training (running python -m train experiment=forecasting/s4-informer-{etth,ettm,ecl,weather} ) :

Epoch 9: 100%|█▉| 1510/1511 [00:24<00:00, 62.27it/s, loss=0.0216, v_num=pbmZ, val/mse=0.421, val/loss=0.421, test/mse=0.266, test/loss=0.266, train/mse=0.0242, train/loss=0.0242Epoch 9, global step 4809: 'val/loss' was not in top 1                                                                                                                             
Epoch 9: 100%|██| 1511/1511 [00:24<00:00, 62.14it/s, loss=0.0216, v_num=pbmZ, val/mse=0.421, val/loss=0.421, test/mse=0.266, test/loss=0.266, train/mse=0.0231, train/loss=0.0231]
Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7fbaccc77f06]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7fbaccc6f8e5]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7fbaccb94e09]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fbaccc78a3d]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7fbaccb92948]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fbaccc78a3d]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7fbaccb4db46]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7fbacc5b246a]
/lib/x86_64-linux-gnu/libc.so.6(+0x43031) [0x7fbb3f73f031]
/lib/x86_64-linux-gnu/libc.so.6(+0x4312a) [0x7fbb3f73f12a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7fbb3f71dc8e]
python(+0x2010a0) [0x56320cea20a0]
Aborted (core dumped)

So it does train, but this is a strange ending assertion error. After looking around, it seems that it is an error that is found by many people regarding aws-sdk-cpp, for example you can find it here: https://github.com/huggingface/datasets/issues/3310

danassou avatar Aug 29 '22 20:08 danassou

Thanks for the additional info! Does this error occur if you uninstall the datasets package then? Does it only happen with AWS?

albertfgu avatar Aug 29 '22 21:08 albertfgu

I can't run the code without the datasets library since it's required - I'm getting no module found error if I do so. To clarify, I'm not running my code with AWS, I am using my university's cluster (I don't really understand why aws-related errors pop up to be honest!)

danassou avatar Aug 29 '22 21:08 danassou

You should be able to remove the dataset dependency by deleting the "lra" import from src/dataloaders/__init__.py

albertfgu avatar Aug 29 '22 22:08 albertfgu

You should be able to remove the dataset dependency by deleting the "lra" import from src/dataloaders/__init__.py

The code seems working on CPU without errors. however, I getting a KeyError 'nvrtc' with pykeops installed. Can you provide us with the pykeops version that you are using?

gitbooo avatar Aug 30 '22 20:08 gitbooo

  1. Does it run when pykeops is uninstalled?
  2. Are you able to install the CUDA extension instead?
  3. Can you try version pip install pykeops==1.5? Later versions of pykeops sometimes cause installations errors for me.
  4. What happens if you follow the instructions on the pykeops page for testing the installation?

albertfgu avatar Aug 30 '22 22:08 albertfgu

  • When pykeops is uninstalled it's working without any error but on CPU.
  • I followed the steps to install CUDA extension but I'm still receiving [2022-08-31 15:42:03,450][src.models.sequence.ss.kernel][WARNING] - CUDA extension for Cauchy multiplication not found. Install by going to extensions/cauchy/ and running python setup.py install. This should speed up end-to-end training by 10-50%
  • I'm not getting the KeyError 'nvrtc' error but I'm getting this instead:
RuntimeError: [KeOps] This KeOps shared object has been compiled without cuda support: 
 1) to perform computations on CPU, simply set tagHostDevice to 0
 2) to perform computations on GPU, please recompile the formula with a working version of cuda.
  • I passed the tests successfully

gitbooo avatar Aug 31 '22 19:08 gitbooo

I am also facing the exact same issue. @gitbooo have you found a solution?

farshchian avatar Sep 04 '22 18:09 farshchian

  1. Without pykeops, the code should still run on GPU. Is there a reason you can only use CPU?
  2. I don't know why the extension isn't working. One note is that it has to be installed for every environment (e.g. for different GPU, CUDA version, etc.). E.g. it doesn't work if different machines are sharing conda environments; you would need to create a separate conda environment for each environment type and install the extension in each one
  3. I've seen that message several times in the past and I think it was always caused by an improper install. Installing from a fresh environment and also installing the latest version of cmake was the solution (pip install pykeops==1.5 cmake)
  4. Were you able to comment out the datasets dependency? It should involve changing one line of code in src/dataloaders/__init__.py

albertfgu avatar Sep 07 '22 19:09 albertfgu