llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

fatal error: cuda.h: No such file or directory

Open Babramson opened this issue 2 years ago • 1 comments

When running the training section of the readme i get an error regarding cuda.h. Is it possible to specify a path for the composer to look for the cuda support? I have cuda.h under ~/anaconda3/include/cuda.h and also have ~/anaconda3/include/cuda_runtime.h which some general stackoverflow articles recommended when running into issues with cuda.h.

What i'm running:

# Train an MPT-125m model for 10 batches
composer train/train.py \
  train/yamls/mpt/125m.yaml \
  data_local=my-copy-c4 \
  train_loader.dataset.split=train_small \
  eval_loader.dataset.split=val_small \
  max_duration=10ba \
  eval_interval=0 \
  save_folder=mpt-125m

The error i'm getting:

/tmp/tmpd3ybp9wv/main.c:2:10: fatal error: cuda.h: No such file or directory
    2 | #include "cuda.h"
      |          ^~~~~~~~
compilation terminated.

There is also a giant stack trace which i can include if that's useful.

Babramson avatar May 11 '23 23:05 Babramson

This fixed the issue for me.

pip install xentropy-cuda-lib@git+https://github.com/HazyResearch/[email protected]#subdirectory=csrc/xentropy

P1ayer-1 avatar May 13 '23 23:05 P1ayer-1

This fixed the issue for me.

pip install xentropy-cuda-lib@git+https://github.com/HazyResearch/[email protected]#subdirectory=csrc/xentropy

Thanks, this solved my issue. I came across another issue once I got past the one I posted, but it was related to my local GPU being a 4090 which is apparently tied to Triton PTX (ptxas fatal : Value 'sm_89' is not defined for option 'gpu-name'), which is obviously not tied to original issue I posted.

I reran the README.md on an A100 40g with the fix recommended from @P1ayer-1 and had no issues with the dataloader/training step.

Babramson avatar May 15 '23 16:05 Babramson