llm.c Getting "Floating point exception (core dumped)" Error

Getting "Floating point exception (core dumped)" Error

Open alvins82 opened this issue 1 year ago • 4 comments

Playing around with re-creating GPT-2 from the thread. When I run train I get the above error. Screenshot below.

Jul 15 '24 10:07 alvins82

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass

# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

Jul 15 '24 10:07 diegoasua

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass
# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

Both of the tests pass. Also put a screenshot of my torch versions above.

Jul 15 '24 10:07 alvins82

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass
# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

Jul 15 '24 10:07 alvins82

eyeballing your cmdline i'd say your batch size is too small and is causing an exception in the hellaswag eval, this is a known issue and we have a patch merged into master that basically forces you to use batch size >= 4

Jul 19 '24 15:07 gordicaleksa

llm.c llm.c copied to clipboard

Getting "Floating point exception (core dumped)" Error

llm.c
llm.c copied to clipboard