bayley
bayley
Digging into the source code, this seems to be intended behavior - the "system" message needs to be at position 0 in the list. I'll dig into the code to...
Yeah, here's a typical request sent by SillyTavern: ``` { messages: [ { role: 'system', content: "Write Coding Sensei's next reply in a fictional chat between Coding Sensei and User....
So...I was looking into this the other day as well. The text-generation-webui implementation seems to simply discard the all but the last system prompt which is clearly not right: ```python...
Looks like the problem is in MLCEngine - this is a minimal reproducer (using the latest nightlies): ```python from mlc_llm import MLCEngine model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC" engine = MLCEngine(model) # Run...
I did confirm the script works on an A4000 RunPod instance, so this is definitely a bug related to pre-SM80 GPUs. I'm happy to help fix (chat works and performs...
Thanks. What exactly do I need to rebuild without flashinfer? I tried explicitly disabling flashinfer (and cutlass) during model lib compilation but it didn't help.
I tried that and it didn't help, I can go back and double check my build settings to make sure though. I did use the prebuilt mlc-ai wheel, could that...
I see, thanks - I'll give it a try. Does it make sense to provide prebuilt wheels that are built without flashinfer? Seems like especially Pascal users could benefit (used...
Success, rebuilt TVM from source following the instructions in the docs (I had to install libzstd-dev through apt) and now MLCEngine works.
When FlashInfer is disabled, what prefill algorithm is used? I noticed a pretty long prompt processing time on Llama-70B and was wondering if it internally used memory-efficient attention (xformers/Pytorch SDPA)...