jetson-containers
jetson-containers copied to clipboard
mlc_llm refactor (no more mlc_chat)
I've been trying to get some version of Mixtral-8x7b-Instruct-0.1 running on my 64GB AGX Orin box. The first failure was "model type mixtral not supported", even though mixtral appears in the list of models supported in mlc_chat. I've almost gotten this working, and I think I've tracked the confusion down to a major refactor in the mlc_llm codebase just 2 weeks ago, wherein mlc_chat simply disappeared, to be replaced by mlc_llm.
Hi @rgobbel, it does appear that it appears in the mlc_chat supported models list in the dustynv/mlc:r36.2.0
container, however sounds like it's not working. I need to update/rebuild the MLC version and probably update the patches for it too. They had been changing to a new model builder (mlc_llm.build
vs mlc_chat
)
I ran it through the stages that had been handled by mlc_llm.build
by hand, and with a little extra massaging (specifying chat-template
on the command line, for example), it's working! It definitely has lower latency than Llama 2-70b, and seems to do at least as well w/r/t content.
Oh that's great @rgobbel! What kind of tokens/sec do you get out of it? On Llama-2-70B I get a max of ~5 tokens/sec on AGX Orin 64GB
Oh that's great @rgobbel! What kind of tokens/sec do you get out of it? On Llama-2-70B I get a max of ~5 tokens/sec on AGX Orin 64GB
I haven't actually tried to do that measurement as yet. How do you recommend doing it in the Jetson containers?
@rgobbel - it would be like this: https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc#benchmarks
that is basically just a wrapper around MLC's benchmark. I believe it should work with both mlc_llm.build
and mlc_chat
based models (but if not, there is mlc_chat bench
for the later)
Ok, here's what I got:
model | quantization | input tokens | output tokens | prefill time | prefill rate | decode time | decode rate | memory |
---|---|---|---|---|---|---|---|---|
Llama-2-7b-chat-hf-q4f16_ft | q4f16_ft | 16 | 128 | 0.33 | 47.87 | 2.69 | 47.63 | 816.75 |
Llama-2-70b-chat-hf-q4f16_ft | q4f16_ft | 16 | 128 | 3.20 | 5.00 | 25.60 | 5.00 | 899.19 |
Mixtral-8x7B-Instruct-v0.1-q4f16_1 | q4f16_1 | 16 | 128 | 1.34 | 11.99 | 6.10 | 21.00 | 1042.19 |
Full results including prompts and outputs: llm-benchmarks.zip
Note: due to the fact that I was hand-coding a few bits that really should be more automated, just to make sure it worked at all, this was compiled without several possibly important optimizations, including CUDA graph execution, flash attention, and separate embedding. The new API makes things a bit confusing, but I'm working on it.
Thanks @rgobbel, that 21 tokens/sec for Mixtral 8x7B looks good and is consistent from what I've heard from other ppl trying it through MLC. I should add it to the LLM benchmarks on Jetson AI Lab. If you get it running faster, let me know.
If would say just to use my local_llm API, because I wrap up a lot of the model builder and API stuff in MLC (including transparent support for both mlc_llm.build and mlc_chat), however I don't have support yet in there for SWA inference (the sliding-window attention that Mistral uses)
Ok, exactly which local_llm image is that (with mixtral support working correctly)?
The default image (dustynv/local_llm:r36.2.0
) tries to use mlc_llm.build
, which then errors out partly because mixtral is not in a list of supported model types. As I recall, simply hand-patching it to include mixtral didn't work either. I got a working model by modifying packages/llm/local_llm/models/mlc.py
to call mlc_chat convert
, mlc_chat get_config
, and mlc_chat compile
, with arguments that worked for each of those, but I didn't see a higher-level function that called all of those correctly, so some flags were not set as I'd have liked.
I've tried build tvm
and mlc_llm
in various ways, but it always seems to run into one roadblock or another. I'm currently wrestling with (on a bare metal build of mlc_llm
):
/usr/local/lib/python3.10/dist-packages/tvm/3rdparty/cutlass_fpA_intB_gemm/cutlass_kernels/../weightOnlyBatchedGemv/kernel.h(362): error: identifier "__hfma2" is undefined
v = __hfma2(*reinterpret_cast<half2*>(weights_f16 + y), *reinterpret_cast<half2*>(in_v + y), v);
^
detected during:
instantiation of "void tensorrt_llm::kernels::weight_only_batched_gemv<QType,WeightOnlyFlag,ActOp,Zero,Bias,NPerBlock,Batch,BlockSize>(const uint8_t *, const half *, const half *, const half *, const half *, half *, int, int, int) [with QType=tensorrt_llm::kernels::WeightOnlyQuantType::Int8b, WeightOnlyFlag=tensorrt_llm::kernels::WeightOnlyPerChannel, ActOp=tensorrt_llm::kernels::GeluActivation, Zero=true, Bias=true, NPerBlock=2, Batch=3, BlockSize=256]" at line 436
instantiation of "void tensorrt_llm::kernels::WeightOnlyBatchedGemvKernelLauncher<QType, WeightOnlyFlag, ActOp, Zero, Bias, NPerBlock, Batch, BlockSize>::run(const tensorrt_llm::kernels::WeightOnlyParams &, cudaStream_t) [with QType=tensorrt_llm::kernels::WeightOnlyQuantType::Int8b, WeightOnlyFlag=tensorrt_llm::kernels::WeightOnlyPerChannel, ActOp=tensorrt_llm::kernels::GeluActivation, Zero=true, Bias=true, NPerBlock=2, Batch=3, BlockSize=256]" at line 24 of /usr/local/lib/python3.10/dist-packages/tvm/3rdparty/cutlass_fpA_intB_gemm/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int8b.cu
For some reason, it has a hard time finding CUDNN includes and libraries. Anyway, I'd much rather have a Docker image that just works.
@rgobbel either use dustynv/local_llm:r36.2.0
or dustynv/mlc:r36.2.0
(which both use MLC commit 607dc5a
), but use MLC commands/libraries directly instead of my local_llm wrappers (as mentioned above, I don't have support for Mistral/Mixtral and SWA inferencing in that yet)
I've managed to get Mistral both and Mixtral built, and Mixtral works very well, but for some reason the Mistral models don't work with the Web chat agent, even though the "chat" command of mlc_llm
/mlc_chat
works fine. Mistral models (but not Mixtral) are winding up missing the embed
function, and as you mentioned there's no separate embedding support for Mistral.
There's another minor issue with Mistral, in that there is no way to tell it that there is no sliding window. One feature of Mistral-7b-Instruct-v0.2 is the lack of a sliding window, so this needs to be a parameter that can be passed in.
I'd be happy to submit PRs for any of this if I could manage to get a clean build. The farthest I've gotten with the latest version of mlc_llm runs into a problem in compiling tvm/3rdparty/cutlass_fpA_intB_gemm
as part of the mlc_llm build, as mentioned above. If you have any suggestions about this issue I'd love to hear them.