llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

[SYCL] Implementing async model loading for non mapped memory

Open OuadiElfarouki opened this issue 1 year ago • 1 comments

This patch implements most of the event APIs for the SYCL backend, fixes the set_tensor_async and enables an async IO / H2D memory copies for model loading (similar to CUDA backend implementation). Some improvement figures (load time) :

  • Nvidia A100 40GB + LLaMa 3.1 70B Q4 : 27.6s (master) -> 5.8s (patch)

  • Intel Arc A770 + LLaMa 3.1 8B Q4 : 1.6s (master) -> 0.8s (patch)

  • [x] I have read the contributing guidelines

  • Self-reported review complexity:

    • [ ] Low
    • [x] Medium
    • [ ] High

OuadiElfarouki avatar Oct 01 '24 15:10 OuadiElfarouki

Fair enough @slaren thanks for the hint. Will draft the current PR until then.

OuadiElfarouki avatar Oct 01 '24 22:10 OuadiElfarouki

@OuadiElfarouki https://github.com/ggerganov/llama.cpp/pull/9707 got merged :tada:

Alcpz avatar Oct 04 '24 08:10 Alcpz