bayley comments

Results 12 comments of


                                            bayley

[Bug] Support multiple "system" messages in REST API

Digging into the source code, this seems to be intended behavior - the "system" message needs to be at position 0 in the list. I'll dig into the code to...

[Bug] Support multiple "system" messages in REST API

Yeah, here's a typical request sent by SillyTavern: ``` { messages: [ { role: 'system', content: "Write Coding Sensei's next reply in a fictional chat between Coding Sensei and User....

[Bug] Support multiple "system" messages in REST API

So...I was looking into this the other day as well. The text-generation-webui implementation seems to simply discard the all but the last system prompt which is clearly not right: ```python...

[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works)

Looks like the problem is in MLCEngine - this is a minimal reproducer (using the latest nightlies): ```python from mlc_llm import MLCEngine model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC" engine = MLCEngine(model) # Run...

[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works)

I did confirm the script works on an A4000 RunPod instance, so this is definitely a bug related to pre-SM80 GPUs. I'm happy to help fix (chat works and performs...

[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works)

Thanks. What exactly do I need to rebuild without flashinfer? I tried explicitly disabling flashinfer (and cutlass) during model lib compilation but it didn't help.

[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works)

I tried that and it didn't help, I can go back and double check my build settings to make sure though. I did use the prebuilt mlc-ai wheel, could that...

[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works)

I see, thanks - I'll give it a try. Does it make sense to provide prebuilt wheels that are built without flashinfer? Seems like especially Pascal users could benefit (used...

[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works)

Success, rebuilt TVM from source following the instructions in the docs (I had to install libzstd-dev through apt) and now MLCEngine works.

[Bug] REST server doesn't work on V100 (SM70) - cudaErrorNoKernelImageForDevice (but chat works)

When FlashInfer is disabled, what prefill algorithm is used? I noticed a pretty long prompt processing time on Llama-70B and was wondering if it internally used memory-efficient attention (xformers/Pytorch SDPA)...