Alex Cheema

Results 117 issues of Alex Cheema

![IMG_0100](https://github.com/user-attachments/assets/2da40147-a4a2-497e-a351-5f42fcca1f9e)

Not a problem on Macs since they only have 1 GPU. On other machines, we default to tinygrad. Right now we pick the `DEFAULT` device. The desired behaviour needs to...

![IMG_0145](https://github.com/user-attachments/assets/106cfe7e-7f5f-436d-9b55-a3c9da78fcf7)

- Right now we already parallelise model downloads which works great and speeds things up a lot - However, loading the model into memory can also be slow for large...

- The deliverable here is to be able to run **existing** quantized models with the tinygrad inference engine - Bonus (+$200) bounty as an easy follow up is to add...

enhancement

This only happens with `BEAM=1`. `BEAM=0`, `BEAM=2`, `BEAM=3` all work fine This happens because exo runs tinygrad inference on another thread. Example command to reproduce: `DEBUG=6 BEAM=1 python3 main.py --inference-engine...

- This is a follow up to #148 - In general model weights on huggingface are a bit of a mess because of different implementations in ML libraries. For example,...

- Automatically select the best interface/networking for a given node. e.g. we should prioritise thunderbolt over WiFi and when that becomes available, automatically switch over - More quickly detect when...

**Motivation:** Batching multiple inference requests together can speed up inference. Batching can even be leveraged with single-input settings for speedups with e.g. staged speculative decoding. **What:** Currently, exo handles inference...

good first issue

![IMG_0116](https://github.com/user-attachments/assets/f7be43e5-964c-4d5f-a566-080ba4bfcc11) - should be as simple as changing the endpoint - https://github.com/paul-gauthier/aider