llama.cpp
llama.cpp copied to clipboard
[SYCL] Implementing async model loading for non mapped memory
This patch implements most of the event APIs for the SYCL backend, fixes the set_tensor_async and enables an async IO / H2D memory copies for model loading (similar to CUDA backend implementation).
Some improvement figures (load time) :
-
Nvidia A100 40GB + LLaMa 3.1 70B Q4 : 27.6s (master) -> 5.8s (patch)
-
Intel Arc A770 + LLaMa 3.1 8B Q4 : 1.6s (master) -> 0.8s (patch)
-
[x] I have read the contributing guidelines
-
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High
Fair enough @slaren thanks for the hint. Will draft the current PR until then.
@OuadiElfarouki https://github.com/ggerganov/llama.cpp/pull/9707 got merged :tada: