Johannes Gäßler
Johannes Gäßler
`./llama-bench` and `./perplexity` are broken with this PR which is why I'm using `./main` for testing. Results for LLaMA 2 7b q4_0: | GPU | Test | t/s master |...
Some single GPU test results: | GPU | Model | Test | t/s master | t/s PR | Speedup | |-------------|-----------------|-------|------------|--------|---------| | 1x RTX 4090 | LLaMA 3 8b f16...
I just noticed that instead of using Github's built-in draft feature you added "DRAFT:" to the title. Please let me know when you think the PR is ready for review,...
How about this: in terms of performance I think it would make sense to double buffer ggml graph creation. So essentially start creating the next ggml graph while the previous...
I just opened a large PR for multi GPU support: https://github.com/ggerganov/llama.cpp/pull/1607
I didn't merge this PR because I wanted someone else to check it as well; as I said, I'm not very knowledgeable about Docker.
I think this is not an issue with the code but rather with the model from that particular repository. The repository says it took the importance matrix from another repository...
Anyways, I've uploaded the importance matrix I generated: https://huggingface.co/JohannesGaessler/llama.cpp_importance_matrices
Good to know, thanks for investigating.
>First, I don't know where JohannesGaessler got his wrong info from, but the repository of course does/did not say that the imatrix is from another repository - he made this...