[SYCL] Implementing async model loading for non mapped memory

Open OuadiElfarouki opened this issue 1 year ago • 1 comments

This patch implements most of the event APIs for the SYCL backend, fixes the set_tensor_async and enables an async IO / H2D memory copies for model loading (similar to CUDA backend implementation). Some improvement figures (load time) :

Nvidia A100 40GB + LLaMa 3.1 70B Q4 : 27.6s (master) -> 5.8s (patch)
Intel Arc A770 + LLaMa 3.1 8B Q4 : 1.6s (master) -> 0.8s (patch)
[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High

Oct 01 '24 15:10 OuadiElfarouki

Fair enough @slaren thanks for the hint. Will draft the current PR until then.

Oct 01 '24 22:10 OuadiElfarouki

@OuadiElfarouki https://github.com/ggerganov/llama.cpp/pull/9707 got merged :tada:

Oct 04 '24 08:10 Alcpz