llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

Getting "Floating point exception (core dumped)" Error

Open alvins82 opened this issue 1 year ago • 4 comments

Playing around with re-creating GPT-2 from the thread. When I run train I get the above error. Screenshot below.

Screenshot 2024-07-15 at 8 18 51 pm Screenshot 2024-07-15 at 8 20 14 pm Screenshot 2024-07-15 at 8 25 58 pm

alvins82 avatar Jul 15 '24 10:07 alvins82

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass

# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

diegoasua avatar Jul 15 '24 10:07 diegoasua

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass

# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu

Both of the tests pass. Also put a screenshot of my torch versions above.

alvins82 avatar Jul 15 '24 10:07 alvins82

No idea what is going on but maybe try compiling without CuDNN make train_gpt2cu USE_CUDNN=0. Probably not the cause, but just to check. Also run the tests see if they pass

# fp32 test (cudnn not supported)
make test_gpt2cu PRECISION=FP32 && ./test_gpt2cu
# mixed precision cudnn test
make test_gpt2cu USE_CUDNN=1 && ./test_gpt2cu
Screenshot 2024-07-15 at 8 29 06 pm

alvins82 avatar Jul 15 '24 10:07 alvins82

eyeballing your cmdline i'd say your batch size is too small and is causing an exception in the hellaswag eval, this is a known issue and we have a patch merged into master that basically forces you to use batch size >= 4

gordicaleksa avatar Jul 19 '24 15:07 gordicaleksa