Charlie Ruan comments

Results 166 comments of


                                            Charlie Ruan

Support concurrent requests to a single model instance

Thanks for reporting this! I'll look into fixing this, perhaps blocking subsequent `chatCompletion()` calls until the previous one is finished, maintaining FCFS. Currently the engine does not support continuous batching,...

Support concurrent requests to a single model instance

Hi @LEXNY this should be fixed in https://github.com/mlc-ai/web-llm/pull/549 and reflected in npm 0.2.61. You can check out the PR description for the specifics of the problem and the solution. Your...

Support concurrent requests to a single model instance

Closing this issue as completed. Feel free to reopen/open new ones if issues arise!

Usage Stats in Intermediate Steps

Thanks for the inquiry! IIUC, you are inquiring about accessing stats in the middle of a streaming generation of the model. I do not exactly understand how the Langchain example...

Introducing: papeg.ai

Big congrats on the release and glad to see your project gaining traction! Thank you for being active in this community and constantly offering valuable feedbacks!

Use subgroup operations when possible

Hi @beaufortfrancois Really appreciate the info and suggestions! We think it is a good idea to have it implemented in the TVM flow. Unfortunately, we are a bit out of...

Use subgroup operations when possible

Thanks for the info! Quick question: do all devices support subgroup ops? Or is it a device-dependent thing? Ideally, we only want to host a single set of WGSL kernels...

Use subgroup operations when possible

I'll try to support shuffle for reduction operation in TVM's WebGPU this week and next week. One possibility is we compile two sets of kernels for each model, one for...

Use subgroup operations when possible

Yes, I hope to get a version by the end of this week if everything goes well.

Use subgroup operations when possible

Hi @beaufortfrancois! I was able to get an initial version done in TVM: https://github.com/apache/tvm/pull/17699 The PR description includes what is done and not done, and the dumped kernel compiled. The...