[BOUNTY - $500] Support multiple models running concurrently
- Currently exo supports multiple requests to the same LLM concurrently (after: https://github.com/exo-explore/exo/pull/282)
- However, if you try to request 2 different LLMs concurrently it fails
Hi I would like to work on this
Hi I would like to work on this
Assigned. Good luck - pls tag me here or on Discord if you have any questions or run into bugs!
Increased bounty to 500 USD as this appears to be harder than anticipated.
No activity for a month. Opening this back up.
@DESU-CLUB, any ideas? what research did you get?
Hey sorry was busy with college
While working on this I found out that the engines had a race condition within the inference engines. When I tried reproducing the bug in both tinygrad and the torch engine I encountered issues such as the following:
I send a request to Model A and Model B, but the response for Model A appears in the chat for Model B and vice versa.
I believe adding semaphores should help solve the issue, but wasn't able to completely implement it bug free because there was an occasional deadlock that I was trying to fix
Hey sorry was busy with college
While working on this I found out that the engines had a race condition within the inference engines. When I tried reproducing the bug in both tinygrad and the torch engine I encountered issues such as the following:
I send a request to Model A and Model B, but the response for Model A appears in the chat for Model B and vice versa.
I believe adding semaphores should help solve the issue, but wasn't able to completely implement it bug free because there was an occasional deadlock that I was trying to fix
Sounds like good progress - reassigned
Hey update on the issue, with the new updates most of the code I wrote last month was invalid, so am working on a new version right now. I'm not too sure how much time I can commit on this so do feel free to unassign me if I am taking too much time.
The issue is still the same, there is a need for a semaphore or a lock to manage states at the model level to let multiple models run concurrently without causing race conditions.