Xuan-Son Nguyen
Xuan-Son Nguyen
`ggml_graph_dump_dot` only used for debugging, so it is not important. `ggml_visit_parents` is called by `ggml_build_forward_expand`, which is called every time `llama_decode` is called. Probably OK to remove `strlen` here, but...
If you want, another idea would be to add a macro `IS_STRING_NOT_EMPTY` / `IS_STRING_EMPTY` and re-use it through out the code base
The link to your model is 404 not found. Anyway, did you check if `added_tokens.json` is set correctly? (The JSON you posted above is from `tokenizer_config.json`)
Probably we can take advantage of Hub API. For example, to list all files in a repo: `https://huggingface.co/api/models/meta-llama/Meta-Llama-3-8B/tree/main` This could potentially remove the need for `--hf-file` and `etag` checking
Cool idea, it will be very useful to keep track of llama.cpp's performance compared to "pure" GPU alternative like TensorRT or exllama. > A [GitHub workflow](https://docs.github.com/en/actions/using-workflows), will: One thing I...
> Yes I understand, but is it a bare metal server that is completely isolated? Servers with T4 GPU are usually "shared CPU but dedicated GPU". I believe that's also...
> My point is that any hidden arcane state needs to be reset before running any benchmark script. On my company we have gitlab runners that plugged into docker on...
Seems interesting. I’m currently limited to working from mobile phone, so can’t have a look right now. I’ll try when I can
You can firstly merge the qlora into the model (that will produce a new set of `.safetensors` files) Then either use `convert.py` or `convert-hf-to-gguf.py` to convert the safetensors model into...
~~The prompt is hard-coded unfortunately. In llama.cpp, you can load arbitrary prompt from a file.~~ ~~Sorry I was so stupid that I haven't looked at the condition `if(!params.prompt.empty())`. Seems like...