Judd
Judd
Yes, some pre-processing rules are ignored and not implemented, which may cause subtle differences. I may add these functions later. At present, I am busy with adding more models.
This does not solve the problem. Maybe you can add another API, with 3 additional params: `token_timestamps`, `split_on_word` and `max_len`.
GPU acceleration is not supported yet.
@cagev It is OK now.
This is dedicated to those who are GPU-poor, but stay tuned. 😄
> @foldl I tried to build it with GPU support by using `cmake -B build-gpu -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DBUILD_SHARED_LIBS=ON` (it works fine for llama.cpp), but compilation fails with errors like that:...
@MoonRide303 Now, this can be built against CUDA. But only a few models work I guess.
How about `-ngl 10`? I have tested Vulkan & CUDA. It is OK with this model.
@MoonRide303 Sorry for the incorrect information. I have tested QWen2.5 7B & Llama3.1 8B with CUDA. Note: models with `lm_head` tied to embedding do not work, generally. ``` build-cuda\bin\Release\main.exe -m...
`-l` (i.e. `--max_length`) shall be used to reduce the VRAM usage. `-c` is used by context extending method. The naming is a little bit confusing.