vectorflow
vectorflow copied to clipboard
Add a reliability mechanism
The hugging face, vdb upload and open ai embeddings workers all need a retry mechanism.
The queue system could be leveraged for this, either a general retry queue at each stage or for each individual worker.
There should be logic to prevent retries when critical system components are down (like open AI's api or a vector DB's host)
I recently added a basic retry mechanism to the worker.py in this PR here. Its a naive implementation of retry, where the system retries a batch up to 3 times by putting it back on the embeddings queue.
What we need to do
- Create a
retryqueue for each existing queue. - Create a dead letter queue, aka
dlq, that holds messages that have already been retried 3 times - Create a cron job or scheduled task that a) moves things from the retry queue back to the main queue b) queries batches that are more than 24 hours old and marks them as FAILED. I think this can run once per hour to start.
- Add logic to
hugging_face/app.py,worker/vdb_upload_worker.pythat puts failed batches onto the retry queue. Be selective about where and when you choose to do this. If something fails because a key is missing or a connection URL is wrong, it shouldn't be retried. Probably retries only make sense for very specific types of exceptions - Alter the logic in
worker.pyto use the retry queue. If something has been retried the maximum number of times, add logic to put it onto the DLQ
Other System Notes
VectorFlow currently has 4 queues:
- extraction - queue holds pointer to file that will be turned into batches
- embeddings - queue holds batches, will get turned into chunks and either embedded with open AI embeddings or passed to the hugging face model queue for embedding
- hugging face model queue - holds chunks for embedding with a hugging face sentence transformer model. Here the name of the queue is the name of the model
- vector database upload - holds chunks & vector embeddings that will be uploaded to a vector store