Claudio Montanari comments

Repositories
Issues
Comments

Results 2 comments of


                                            Claudio Montanari

Prefix caching causes 2 different responses from the same HTTP call with seed set depending on what machine calls

You should be able to disable prefix caching by starting the server with `PREFIX_CACHING=0`. That's how I got the `llama 3.2 vision` models to work.

Different inference results and speed between /generate and OpenAI endpoint

Hey, based on your logs I think this is expected behavior. The output of your `curl` for `/v1//chat/completions` reports `14` completion tokens. Based on your logs for the 1st request...