Abhay Saxena

Results 40 comments of Abhay Saxena

The Telepresence documentation has a [list of dependencies](https://www.telepresence.io/reference/install.html#dependencies), but it does not explain how to install these dependencies on common platforms. We could use your help to fix this! Can...

Excellent! Roughly speaking, you've implemented somewhat of a variant of option 1 by hand. I agree that your approach should work in many cases. I am going to experiment some...

Not sure whether you're ready for feedback on this, but I'm _very_ excited for this feature. ``` llama-server --model ${gguf}/DeepSeek-V3.1-Terminus-UD-Q4_K_XL-00001-of-00008.gguf --alias deepseek/deepseek-v3.1-terminus --jinja -fa on --reasoning-budget 0 --reasoning-format deepseek --fit-ctx...

> does it work without `--cache-ram`? No. Same error, down to the number: `failed to allocate CUDA1 buffer of size 11693719552`

Same command as before, except without `--cache-ram`, yields ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9,...

Log file: [ark3-fit.log](https://github.com/user-attachments/files/23196819/ark3-fit.log)

It worked! VRAM usage is pretty good too, 89% and 96% once I added `-ub 4096 -b 8192`, without which PP was unbearably slow. More importantly, I was able to...

Can everything your `--fit` code determines be expressed as a set of `-ot` options and the like? If so, would it be possible to have a separate utility that does...

Great! In case the motivation is not obvious, that would allow an alternative if there is resistance to adding this feature due to the new startup time cost.

The output from `llama-fit-params` appears to match what `llama-server` does. The result this time is slightly worse on VRAM usage (87% and 97% by `nvtop`) than last time, but still...