web-llm icon indicating copy to clipboard operation
web-llm copied to clipboard

Support concurrent requests to a single model instance

Open lexny opened this issue 1 year ago • 2 comments

If you make multiple requests with the same engine without awaiting, you get garbage.

I would like to make multiple concurrent (ideally parallel) requests to the same engine, without loading the same model into memory multiple times.

Even with stream: false, the engine uses streaming internally, those streams interleave, and the engine gets confused.

Reproduction:

    import {CreateMLCEngine} from '@mlc-ai/web-llm';

    (async () => {
      const engine = await CreateMLCEngine("Phi-3-mini-4k-instruct-q4f16_1-MLC")

      const query = async (content) => {
        await engine.chatCompletion({
          stream: false,
          messages: [
            { role: "user", content }
          ],
        });
        console.log(await engine.getMessage())
      }

      query('respond only in json')
      query('tell me something obscure about botany')
    })()

Output:

I In { C' thege_s fieldographybotcience of,any, rare oneoneun phenomen is,commonon the the knowledge
called occurrence observation of ",, plantsFplth exhiblorantsatsit a knownuating specific location location
environment location over for for, time time timeperiod time. times. This.. This occursobserv t, whenationsion
often plant ofan, species species species un of are orcommon a, arelyis that noticed exhib species exhiblyit
init species in a. location a location Loc location location,ale but, thatad usually often. an un an

lexny avatar Aug 03 '24 02:08 lexny

Thanks for reporting this! I'll look into fixing this, perhaps blocking subsequent chatCompletion() calls until the previous one is finished, maintaining FCFS. Currently the engine does not support continuous batching, so this may be the only way resolving it as of now. That is, despite you can call multiple chatCompletion()s concurrently, to ensure correctness, they have to be executed sequentially.

However, if you instantiate multiple engines, two requests can be processed concurrently. We will soon support having multiple models loaded in a single engine, so in that case same thing applies.

CharlieFRuan avatar Aug 04 '24 17:08 CharlieFRuan

Thank you.

I don't strictly need this to run in parallel (though that would be nice). The concurrency bug is very nonintuitive though and worth fixing.

I did some investigation and this is a gross start, but I think if you separate out this.outputIds.push per completion, that could fix it.

https://github.com/LEXNY/web-llm/commit/463d075f85c466350956cc85729c9c69cada6501

Some sort of key to identify the specific request and keep track of its specific outputIds.

ghost avatar Aug 04 '24 18:08 ghost

Hi @LEXNY this should be fixed in https://github.com/mlc-ai/web-llm/pull/549 and reflected in npm 0.2.61. You can check out the PR description for the specifics of the problem and the solution.

Your example now works, though the second request does not start until the first request is finished, as we maintain a FCFS schedule, with only one request running per-model. However, there can be multiple models running in an engine, hence multiple requests can be running per-engine. For more, you can try examples/multi-models.

CharlieFRuan avatar Aug 19 '24 21:08 CharlieFRuan

Powerhouse!

ghost avatar Aug 19 '24 21:08 ghost

Closing this issue as completed. Feel free to reopen/open new ones if issues arise!

CharlieFRuan avatar Aug 23 '24 17:08 CharlieFRuan