inference icon indicating copy to clipboard operation
inference copied to clipboard

ENH: Support concurrent embedding, update LangChain QA demo with multithreaded embedding creation

Open jiayini1119 opened this issue 1 year ago • 2 comments

jiayini1119 avatar Aug 14 '23 07:08 jiayini1119

Embedding is a CPU-intensive call, and even for a stateless actor, it is not executed simultaneously because the current loop lock is not released until the first call. Therefore, the embedding operation needs to be called with 'to_thread' in model actor. However, I have tried it, and even embedding is not thread-safe for llamacpp, and the process results in a core dump if called concurrently.

aresnow1 avatar Aug 25 '23 15:08 aresnow1

We can first try supporting concurrent embedding creation for PyTorch models.

Embedding is a CPU-intensive call, and even for a stateless actor, it is not executed simultaneously because the current loop lock is not released until the first call. Therefore, the embedding operation needs to be called with 'to_thread' in model actor. However, I have tried it, and even embedding is not thread-safe for llamacpp, and the process results in a core dump if called concurrently.

jiayini1119 avatar Aug 28 '23 04:08 jiayini1119