s4
s4 copied to clipboard
Error after installing CUDA extension for Cauchy multiplication
I'm trying to reproduce experiments but the code is retuning a KeyError 'nvrtc' and the warning [src.models.sequence.ss.kernel][WARNING] - CUDA extension for Cauchy multiplication not found
still appearing.
Otherwise, I'm getting this erreur : Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Can you elaborate on the 'nvrtc' error? Can you uninstall and reinstall the extension (pip uninstall cauchy-mult
and cd extensions/cauchy && python setup.py install
) and copy what it prints?
Does the code run if you completely uninstall the extension? What about if you install pykeops
?
After doing multiple tests, I realized that the cauchy extension is not the problem (although it is strange that even after installing the extension, the code still returns "CUDA extension for cauchy multiplication not found"), but it is the second error that I cannot resolve:
Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7f8411eaaf06]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7f8411ea28e5]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7f8411dc7e09]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f8411eaba3d]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7f8411dc5948]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f8411eaba3d]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7f8411d80b46]
/home/"""/.conda/envs/s4env/lib/python3.10/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7f84117e546a]
/lib/x86_64-linux-gnu/libc.so.6(+0x43161) [0x7f84893fb161]
/lib/x86_64-linux-gnu/libc.so.6(+0x4325a) [0x7f84893fb25a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7f84893d9bfe]
python(+0x2125d4) [0x564a6c3b15d4]
/var/spool/slurmd/job2192901/slurm_script: line 18: 17719 Aborted
I haven't seen this error before. Just to confirm, this happens even with the extension uninstalled? Does your environment work with other codebases? Outside of the extension, there is nothing fancy with requirements for this repository.
Yeah, the extension is not installed. however I'm getting this error at the end of training after epoch 9 is finished.
Hi, I also have the same error, at the end of the training (running python -m train experiment=forecasting/s4-informer-{etth,ettm,ecl,weather}
) :
Epoch 9: 100%|█▉| 1510/1511 [00:24<00:00, 62.27it/s, loss=0.0216, v_num=pbmZ, val/mse=0.421, val/loss=0.421, test/mse=0.266, test/loss=0.266, train/mse=0.0242, train/loss=0.0242Epoch 9, global step 4809: 'val/loss' was not in top 1
Epoch 9: 100%|██| 1511/1511 [00:24<00:00, 62.14it/s, loss=0.0216, v_num=pbmZ, val/mse=0.421, val/loss=0.421, test/mse=0.266, test/loss=0.266, train/mse=0.0231, train/loss=0.0231]
Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7fbaccc77f06]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7fbaccc6f8e5]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7fbaccb94e09]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fbaccc78a3d]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7fbaccb92948]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7fbaccc78a3d]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7fbaccb4db46]
/home/.conda/envs/supergood_env_cluster/lib/python3.9/site-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7fbacc5b246a]
/lib/x86_64-linux-gnu/libc.so.6(+0x43031) [0x7fbb3f73f031]
/lib/x86_64-linux-gnu/libc.so.6(+0x4312a) [0x7fbb3f73f12a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7fbb3f71dc8e]
python(+0x2010a0) [0x56320cea20a0]
Aborted (core dumped)
So it does train, but this is a strange ending assertion error. After looking around, it seems that it is an error that is found by many people regarding aws-sdk-cpp, for example you can find it here: https://github.com/huggingface/datasets/issues/3310
Thanks for the additional info! Does this error occur if you uninstall the datasets
package then? Does it only happen with AWS?
I can't run the code without the datasets
library since it's required - I'm getting no module found error if I do so. To clarify, I'm not running my code with AWS, I am using my university's cluster (I don't really understand why aws-related errors pop up to be honest!)
You should be able to remove the dataset
dependency by deleting the "lra" import from src/dataloaders/__init__.py
You should be able to remove the
dataset
dependency by deleting the "lra" import fromsrc/dataloaders/__init__.py
The code seems working on CPU without errors. however, I getting a KeyError 'nvrtc'
with pykeops installed. Can you provide us with the pykeops version that you are using?
- Does it run when pykeops is uninstalled?
- Are you able to install the CUDA extension instead?
- Can you try version
pip install pykeops==1.5
? Later versions of pykeops sometimes cause installations errors for me. - What happens if you follow the instructions on the pykeops page for testing the installation?
- When pykeops is uninstalled it's working without any error but on CPU.
- I followed the steps to install CUDA extension but I'm still receiving
[2022-08-31 15:42:03,450][src.models.sequence.ss.kernel][WARNING] - CUDA extension for Cauchy multiplication not found. Install by going to extensions/cauchy/ and running
python setup.py install. This should speed up end-to-end training by 10-50%
- I'm not getting the
KeyError 'nvrtc'
error but I'm getting this instead:
RuntimeError: [KeOps] This KeOps shared object has been compiled without cuda support:
1) to perform computations on CPU, simply set tagHostDevice to 0
2) to perform computations on GPU, please recompile the formula with a working version of cuda.
- I passed the tests successfully
I am also facing the exact same issue. @gitbooo have you found a solution?
- Without pykeops, the code should still run on GPU. Is there a reason you can only use CPU?
- I don't know why the extension isn't working. One note is that it has to be installed for every environment (e.g. for different GPU, CUDA version, etc.). E.g. it doesn't work if different machines are sharing conda environments; you would need to create a separate conda environment for each environment type and install the extension in each one
- I've seen that message several times in the past and I think it was always caused by an improper install. Installing from a fresh environment and also installing the latest version of
cmake
was the solution (pip install pykeops==1.5 cmake
) - Were you able to comment out the
datasets
dependency? It should involve changing one line of code insrc/dataloaders/__init__.py