Vincent Bosch comments

Results 32 comments of


                                            Vincent Bosch

BUG: HTTP endpoint loses connection and generation never stops

I agree that inference should automatically be cancelled as soon as the connection is lost. However, I am more curious as to why the connection drops. First I thought it's...

[Bug]: Librechat doesn't wait until Ollama has loaded model

I tested a bit more. Even when the model is loaded, but prompted with a large prompt, a timeout occurs. I think that the timeout should be removed when using...

[Bug]: Librechat doesn't wait until Ollama has loaded model

@danny-avila Did you have a chance to look into this issue? Would be great if timeouts for custom endpoints can be changed and/or disabled completely. Thanks!

[Bug] Speculative decoding small draft doesn't work on macOS

@MasterJH5574 Thanks for the quick response! I just updated to the latest nightly and retried. Small draft-mode does work now, however the speed running with small draft is slower than...

bug: Model keeps reloading until out of memory followed by crash on macOS due to context length too large

Just tried it with the latest nightly 274 and the issue is still present.

bug: Model keeps reloading until out of memory followed by crash on macOS due to context length too large

Update: I retried with a smaller context size. 8192 instead of the context size of the model (32768). Now the model loads correctly and I can interact with it. The...

Load failed in TypingMind and big-AGI

In addition, big-AGI reports the following error: "**[Service Issue] Openai**: fetch failed - SocketError: other side closed · {"name":"SocketError","code":"UND_ERR_SOCKET","socket":"

Add support for Gemma 3?

> I already made a PR to MLX-LM to support 1B. > > https://github.com/ml-explore/mlx-examples/pull/1336 You're very quick! Great work! Would that PR also work for text only use of larger...

Add support for Gemma 3?

I just tried the converted model in "--chat"-mode, but as response to a text-only query I get only "< pad >" as output

Feature Request: Proper Llama 3.1 Support in llama.cpp

I have just converted the model from hf to gguf and then quantized to Q8 with the following extra options: --leave-output-tensor --token-embedding-type f16. Model seems to be responding quite good,...