Pierrick Hymbert
Pierrick Hymbert
I was not aware, but this is not asserted in the parallel test suite AFAIK. Also, I recall that each architecture generates different results.
@TevinWang do you need help on the global variable issue ? @ggerganov do you confirm the issue with the proposed approach? IMHO this contribution is valuable as the console output...
@rgerganov Nice to meet you :D
Hi, you would better have a look at llama.cpp : https://github.com/ggerganov/llama.cpp/blob/f184dd920852d6d372b754f871ee06cfe6f977ad/llama.cpp#L13599
Please submit a PR :)
@ggerganov @ngxson @slaren appreciate your early feedback on the approach before I start implementing too much
> Servers with T4 GPU are usually "shared CPU but dedicated GPU". I believe that's also the case with other GPU like A100 or A10G, but not sure if it's...
@ggerganov We need to keep this in mind: >Warning: We recommend that you only use self-hosted runners with private repositories. This is because forks of your public repository can potentially...
@ggerganov what about the defragmentation target for the [baseline](https://github.com/ggerganov/llama.cpp/pull/6283/files#diff-a5e740be96415373789689f814583e93ff2a8f05eae6481e94505fd6cb6bc6a7), without, I see lot of: `update_slots : failed to find free space in the KV cache, retrying with smaller n_batch =...
First workflow ready to receive feedback: - comment added automatically: https://github.com/phymbert/llama.cpp/pull/1#issuecomment-2018674798 - workflow which run on the Azure T4 self-hosted github runner: https://github.com/phymbert/llama.cpp/actions/runs/8428687623/job/23082119753 - code: https://github.com/ggerganov/llama.cpp/pull/6283 Based on this, we...