Chi Kim comments

Results 161 comments of


                                            Chi Kim

feat: add support for flash_attn

Is there disadvantage? If not, should be enabled by default for everyone other than for the systems that can't support?

Bring back the EMBED feature in the Modelfile

Does Ollama support any embedding model yet? If so, which and where can I get?

how to convert it to the new ggml format

I know llama.cpp is designed for cpu in mind, but is there Python library to run quantized GGML models on Colab with GPU for faster result?

is there a way to calculate token size?

@jmorganca could you exposed these api points to Ollama? Llama.cpp server has POST /tokenize and POST /detokenize. https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md Then we can just count the number of tokens after tokenize?

Does not produce same results via curl API for the same model.

Make sure you specify a seed, so random generator use the same seed every time.

Add a /clear command

Awesome! Does it work for api as well?

More Quants for command-r-plus Please?

I already did even before Ollama had. I'm requesting for other people.

llama3:70b generating gibberish

Try 0.1.32. It works nicely for me. https://github.com/ollama/ollama/releases

tqdm_gui.write() in the GUI instead of console?

I'm using a package that uses tqdm to print progress on console. Is there a way to capture output without modifying the package I'm using, so I can display on...

Clicking regenerate or setting generation attempts>1 cause silero_tts to get into loop playback.

It seems like setting no_stream fixed for clicking regenerate. However, if you have generation attempts to 5, it sounds like whenever it generates new response, it interrupts the audio and...