Charlie Ruan
Charlie Ruan
Thanks for reporting this! I'll look into fixing this, perhaps blocking subsequent `chatCompletion()` calls until the previous one is finished, maintaining FCFS. Currently the engine does not support continuous batching,...
Hi @LEXNY this should be fixed in https://github.com/mlc-ai/web-llm/pull/549 and reflected in npm 0.2.61. You can check out the PR description for the specifics of the problem and the solution. Your...
Closing this issue as completed. Feel free to reopen/open new ones if issues arise!
Thanks for the inquiry! IIUC, you are inquiring about accessing stats in the middle of a streaming generation of the model. I do not exactly understand how the Langchain example...
Big congrats on the release and glad to see your project gaining traction! Thank you for being active in this community and constantly offering valuable feedbacks!
Hi @beaufortfrancois Really appreciate the info and suggestions! We think it is a good idea to have it implemented in the TVM flow. Unfortunately, we are a bit out of...
Thanks for the info! Quick question: do all devices support subgroup ops? Or is it a device-dependent thing? Ideally, we only want to host a single set of WGSL kernels...
I'll try to support shuffle for reduction operation in TVM's WebGPU this week and next week. One possibility is we compile two sets of kernels for each model, one for...
Yes, I hope to get a version by the end of this week if everything goes well.
Hi @beaufortfrancois! I was able to get an initial version done in TVM: https://github.com/apache/tvm/pull/17699 The PR description includes what is done and not done, and the dumped kernel compiled. The...