Xuan-Son Nguyen comments

Results 111 comments of


                                            Xuan-Son Nguyen

Does it make sense to optimize strlen in this function with for loops?

`ggml_graph_dump_dot` only used for debugging, so it is not important. `ggml_visit_parents` is called by `ggml_build_forward_expand`, which is called every time `llama_decode` is called. Probably OK to remove `strlen` here, but...

Does it make sense to optimize strlen in this function with for loops?

If you want, another idea would be to add a macro `IS_STRING_NOT_EMPTY` / `IS_STRING_EMPTY` and re-use it through out the code base

EOT token incorrectly set for Mistral-v0.2 trained with added ChatML tokens

The link to your model is 404 not found. Anyway, did you check if `added_tokens.json` is set correctly? (The JSON you posted above is from `tokenizer_config.json`)

llama : save downloaded models to local cache

Probably we can take advantage of Hub API. For example, to list all files in a repo: `https://huggingface.co/api/models/meta-llama/Meta-Llama-3-8B/tree/main` This could potentially remove the need for `--hf-file` and `etag` checking

server: bench: continuous performance testing

Cool idea, it will be very useful to keep track of llama.cpp's performance compared to "pure" GPU alternative like TensorRT or exllama. > A [GitHub workflow](https://docs.github.com/en/actions/using-workflows), will: One thing I...

server: bench: continuous performance testing

> Yes I understand, but is it a bare metal server that is completely isolated? Servers with T4 GPU are usually "shared CPU but dedicated GPU". I believe that's also...

server: bench: continuous performance testing

> My point is that any hidden arcane state needs to be reset before running any benchmark script. On my company we have gitlab runners that plugged into docker on...

server: bench: continuous performance testing

Seems interesting. I’m currently limited to working from mobile phone, so can’t have a look right now. I’ll try when I can

How to quantize fine-tune LLM into GGUF format

You can firstly merge the qlora into the model (that will produce a new set of `.safetensors` files) Then either use `convert.py` or `convert-hf-to-gguf.py` to convert the safetensors model into...

John Smith

~~The prompt is hard-coded unfortunately. In llama.cpp, you can load arbitrary prompt from a file.~~ ~~Sorry I was so stupid that I haven't looked at the condition `if(!params.prompt.empty())`. Seems like...