Alex Cheema
Alex Cheema

Not a problem on Macs since they only have 1 GPU. On other machines, we default to tinygrad. Right now we pick the `DEFAULT` device. The desired behaviour needs to...

- Right now we already parallelise model downloads which works great and speeds things up a lot - However, loading the model into memory can also be slow for large...
- The deliverable here is to be able to run **existing** quantized models with the tinygrad inference engine - Bonus (+$200) bounty as an easy follow up is to add...
This only happens with `BEAM=1`. `BEAM=0`, `BEAM=2`, `BEAM=3` all work fine This happens because exo runs tinygrad inference on another thread. Example command to reproduce: `DEBUG=6 BEAM=1 python3 main.py --inference-engine...
- This is a follow up to #148 - In general model weights on huggingface are a bit of a mess because of different implementations in ML libraries. For example,...
- Automatically select the best interface/networking for a given node. e.g. we should prioritise thunderbolt over WiFi and when that becomes available, automatically switch over - More quickly detect when...
**Motivation:** Batching multiple inference requests together can speed up inference. Batching can even be leveraged with single-input settings for speedups with e.g. staged speculative decoding. **What:** Currently, exo handles inference...
 - should be as simple as changing the endpoint - https://github.com/paul-gauthier/aider