llm-foundry
llm-foundry copied to clipboard
fatal error: cuda.h: No such file or directory
When running the training section of the readme i get an error regarding cuda.h. Is it possible to specify a path for the composer to look for the cuda support? I have cuda.h under ~/anaconda3/include/cuda.h and also have ~/anaconda3/include/cuda_runtime.h which some general stackoverflow articles recommended when running into issues with cuda.h.
What i'm running:
# Train an MPT-125m model for 10 batches
composer train/train.py \
train/yamls/mpt/125m.yaml \
data_local=my-copy-c4 \
train_loader.dataset.split=train_small \
eval_loader.dataset.split=val_small \
max_duration=10ba \
eval_interval=0 \
save_folder=mpt-125m
The error i'm getting:
/tmp/tmpd3ybp9wv/main.c:2:10: fatal error: cuda.h: No such file or directory
2 | #include "cuda.h"
| ^~~~~~~~
compilation terminated.
There is also a giant stack trace which i can include if that's useful.
This fixed the issue for me.
pip install xentropy-cuda-lib@git+https://github.com/HazyResearch/[email protected]#subdirectory=csrc/xentropy
This fixed the issue for me.
pip install xentropy-cuda-lib@git+https://github.com/HazyResearch/[email protected]#subdirectory=csrc/xentropy
Thanks, this solved my issue. I came across another issue once I got past the one I posted, but it was related to my local GPU being a 4090 which is apparently tied to Triton PTX (ptxas fatal : Value 'sm_89' is not defined for option 'gpu-name'), which is obviously not tied to original issue I posted.
I reran the README.md on an A100 40g with the fix recommended from @P1ayer-1 and had no issues with the dataloader/training step.