Georgi Gerganov comments

Results 420 comments of


Georgi Gerganov

Add large-v2 model

It's already added: https://github.com/ggerganov/whisper.cpp/commit/9fe7306f4b16a974361b6a8bea370d6f5c3552f2 The old one is called `large-v1`. The new one is called just `large`. If you have already downloaded the old one, make sure to rename it...

Complied main.exe that can probably be used on Windows 10/11 x64 PCs

Hi @regstuff , The Windows build is currently a weak point of the project, mainly because I don't have this operating system available to test with. Having a precompiled executable...

Complied main.exe that can probably be used on Windows 10/11 x64 PCs

@RYucel See the Windows steps here: https://github.com/ggerganov/whisper.cpp/actions/runs/3517978497/workflow#L117-L144 Or check cross-compilation instructions here: https://github.com/ggerganov/whisper.cpp/issues/168

Server: possibility of customizable chat template?

Users that want to support a certain template should open a PR and implement it in the framework that we already have

New optimization from NVIDIA to use CUDA Graphs in llama.cpp

Just sent you a collaborator invite Edit: on second thought, I revoked the invite for the moment. I just noticed that your Github account is very new so I hope...

65B model eventually fails with "ggml_new_tensor_impl: not enough space in the scratch memory"

When the contexts swap occurs and it has to re-evaluate the second half of the context (i.e. `n_ctx/2 = 1024` tokens), one of the "scratch" buffers runs out of memory....

Fix flash-attn for AMD

Alright. Thank you very much for the help. I will update the target branch to disable flash attention when HIP is enabled for now

Tanh is not implemented

It's just it hasn't been needed yet. You can either submit a PR implementing it, or you can use the existing `ggml_map_unary_f32()` which allows you to write custom operators in...

model's reply is incomplete with setting nlen to 64

You can easily modify the example to check for EOS token and stop

FlashAttention: pragma unroll, use_mask template parameter

Here are results on V100 using: ```bash # baseline LLAMA_CUBLAS=1 make -j tests && ./tests/test-backend-ops -o ATTN -b CUDA0 perf # flash attn LLAMA_CUBLAS=1 make -j tests && ./tests/test-backend-ops -o...