frob

Results 791 comments of frob

Note there will be a delay after setting `num_gpu` as the model is reloaded into RAM. ```console $ ollama run deepseek-r1:671b-fixed --verbose >>> hello ... eval rate: 18.79 tokens/s >>>...

Size of the server grows because of changes to the runners. ```console $ ps -C "$(echo ollama:0.3.{0..12})" -o comm,rss COMMAND RSS ollama:0.3.0 576792 ollama:0.3.1 575288 ollama:0.3.2 575684 ollama:0.3.3 516652 ollama:0.3.4...

``` COMMAND RSS ollama:0.3.0 590152 ollama:0.3.1 588480 ollama:0.3.2 589428 ollama:0.3.3 587956 ollama:0.3.4 591080 ollama:0.3.5 590044 ollama:0.3.6 588816 ollama:0.3.7 903948 ollama:0.3.8 903636 ollama:0.3.9 903632 ollama:0.3.10 1060860 ollama:0.3.11 1065592 ollama:0.3.12 1066252 ollama:0.3.13...

```yaml services: ollama: environment: OLLAMA_FLASH_ATTENTION: 1 ``` or ``` docker run -e OLLAMA_FLASH_ATTENTION=1 ollama/ollama ```

Other than tuning the prompts, there's no mechanism for that at the moment. Potentially relevant: https://github.com/ollama/ollama/issues/2415, https://github.com/ollama/ollama/issues/8110

According to Wikipedia it has a compute capability of [5.0](https://en.wikipedia.org/wiki/CUDA#:~:text=K620M%2C%20NVS%20810-,Tesla%20M10,-5.2) so yes, but it's not listed in Nvidia's compute capability [page](https://developer.nvidia.com/cuda-gpus).

Vulkan (https://github.com/ollama/ollama/pull/11835) will restore support.

``` 7月 24 15:40:33 buaa-KVM ollama[458186]: llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default' 7月 24 15:40:33 buaa-KVM ollama[458186]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/src/llama.cpp:5570: unicode_cpts_from_utf8(word).size() > 0 ``` The model is not supported...

https://github.com/ggerganov/llama.cpp/pull/7795

Partly downloaded models will be removed if you restart the server. If you make room on the filesystem and restart the download, the previously downloaded parts of the model will...