Results 171 comments of Charlie Ruan

@cometta Hmm is there a specific reason for this? We do have APIs to delete the model weights from cache

Hi, out of curiosity, which version of mlc-llm are you using, what is the length of the context, and which model is it? I remember an **older version** of mlc-llm...

Thanks for raising the issue. @harrywhoo is close to fixing this

Hi folks, sorry for the delay, it is still undergoing. In the meantime, to unblock immediately, it might be helpful to checkout to the commits listed in this PR and...

Hi, thanks for your interest! You can check out this example for how to use RAG w/ WebLLM: https://github.com/mlc-ai/web-llm/tree/main/examples/embeddings We support `snowflake-arctic-embed` as of now

Hi! Yes, the b4 and b32 wasms have different WebGPU kernels, but share the same weights (hence the same HF URLs). See https://github.com/mlc-ai/web-llm/pull/538 for details: > `b32` means the model...

> Do I need to manually truncate my inputs to be size context_length - max_tokens? If you want to make sure you can decode `max_tokens` number of tokens, then yes....

I was able to reproduce it, should be an issue on [tokenizers-cpp/web](https://github.com/mlc-ai/tokenizers-cpp/tree/main/web), where it does not work with certain `tokenizer.json`. A minimal example to reproduce is to run the following...

My initial guess is the `padding` field in `tokenizer.json` triggers this issue. This is not present in your original weight: https://huggingface.co/julientfai/Qwen2-0.5B-Instruct-q4f16_1-Opilot/blob/main/tokenizer.json But this is present in both: - My `tokenizer.json`:...

Confirmed on my end the issue is fixed with WebLLM npm 0.2.57. For more see https://github.com/mlc-ai/tokenizers-cpp/pull/42