llama.cpp CUDA non-determinism on identical requests

CUDA non-determinism on identical requests

Open phiharri opened this issue 1 year ago • 13 comments

When layers are offloaded with CUDA, sending identical requests to the examples/server completion API returns a different response the "first time":

$ for x in `seq 5`; do curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Some random words:","n_predict":50,"seed":1337}' | jq '.content' ; done
" hollow, glowing, tinkling, crunchy, sinking"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"

This seems cache related as responses then remain the same until a different prompt is processed, after which the differing first response occurs again:

$ curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Building a website is as simple as","n_predict":0}' >/dev/null

$ for x in `seq 5`; do curl -s -X POST --url 'http://miku:8080/completion' --data '{"prompt":"Some random words:","n_predict":50,"seed":1337}' | jq '.content' ; done
" hollow, glowing, tinkling, crunchy, sinking"
" apartment, blouse, bobby, carousel"
" apartment, blouse, bobby, carousel"
..

Expected Behaviour

Output should remain the same when parameters and seed are constant.

Other Observations

Not observed with Metal offload.
Not observed without CUDA offload (interestingly neither with small n-gpu-layers, eg. CodeLlama-34b shows this behaviour with -ngl 3 but not with -ngl 2).
Behaviour observed with both non-K and K-quants.
The first response ("hollow, glowing.." above) is what examples/main returns with the same parameters.

Environment

Verified behaviour on latest master commit, compiled with LLAMA_CUBLAS=1 make -j
Linux 5.15.0-79-generic x86_64
NVIDIA 535.86.05
CUDA 12.2
Python 3.10.12
GNU Make 4.3
g++ 11.4.0

Thanks for reading! 😎

Aug 27 '23 17:08 phiharri

llama.cpp llama.cpp copied to clipboard

CUDA non-determinism on identical requests

Expected Behaviour

Other Observations

Environment

llama.cpp
llama.cpp copied to clipboard