Better support for GPU and Flash Attention during inference

Open vikhyat opened this issue 1 year ago • 1 comments

The inference code provided in this repository forces moondream to run on CPU. We should allow the user to leverage GPUs and Flash Attention for faster inference if they want to.

Jan 24 '24 22:01 vikhyat

Added CUDA support in #22

Jan 25 '24 12:01 spartanhaden