llama.cpp
llama.cpp copied to clipboard
CUDA non-determinism on identical requests
When layers are offloaded with CUDA, sending identical requests to the examples/server completion API returns a different response the "first time":
$ for x in `seq 5`; do curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Some random words:","n_predict":50,"seed":1337}' | jq '.content' ; done
" hollow, glowing, tinkling, crunchy, sinking"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
This seems cache related as responses then remain the same until a different prompt is processed, after which the differing first response occurs again:
$ curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Building a website is as simple as","n_predict":0}' >/dev/null
$ for x in `seq 5`; do curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Some random words:","n_predict":50,"seed":1337}' | jq '.content' ; done
" hollow, glowing, tinkling, crunchy, sinking"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
..
Expected Behaviour
Output should remain the same when parameters and seed are constant.
Other Observations
- Not observed with Metal offload.
- Not observed without CUDA offload (interestingly neither with small n-gpu-layers, eg. CodeLlama-34b shows this behaviour with
-ngl 3
but not with-ngl 2
). - Behaviour observed with both non-K and K-quants.
- The first response ("hollow, glowing.." above) is what examples/main returns with the same parameters.
Environment
- Verified behaviour on latest master commit, compiled with
LLAMA_CUBLAS=1 make -j
- Linux 5.15.0-79-generic x86_64
- NVIDIA 535.86.05
- CUDA 12.2
- Python 3.10.12
- GNU Make 4.3
- g++ 11.4.0
Thanks for reading! 😎