llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

CUDA non-determinism on identical requests

Open phiharri opened this issue 1 year ago • 13 comments

When layers are offloaded with CUDA, sending identical requests to the examples/server completion API returns a different response the "first time":

$ for x in `seq 5`; do curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Some random words:","n_predict":50,"seed":1337}' | jq '.content' ; done
" hollow, glowing, tinkling, crunchy, sinking"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"

This seems cache related as responses then remain the same until a different prompt is processed, after which the differing first response occurs again:

$ curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Building a website is as simple as","n_predict":0}' >/dev/null

$ for x in `seq 5`; do curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Some random words:","n_predict":50,"seed":1337}' | jq '.content' ; done
" hollow, glowing, tinkling, crunchy, sinking"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
..

Expected Behaviour

Output should remain the same when parameters and seed are constant.

Other Observations

  • Not observed with Metal offload.
  • Not observed without CUDA offload (interestingly neither with small n-gpu-layers, eg. CodeLlama-34b shows this behaviour with -ngl 3 but not with -ngl 2).
  • Behaviour observed with both non-K and K-quants.
  • The first response ("hollow, glowing.." above) is what examples/main returns with the same parameters.

Environment

  • Verified behaviour on latest master commit, compiled with LLAMA_CUBLAS=1 make -j
  • Linux 5.15.0-79-generic x86_64
  • NVIDIA 535.86.05
  • CUDA 12.2
  • Python 3.10.12
  • GNU Make 4.3
  • g++ 11.4.0

Thanks for reading! 😎

phiharri avatar Aug 27 '23 17:08 phiharri