TensorRT-LLM Support for Mistral Nemo

https://mistral.ai/news/mistral-nemo/

Would Mistral Nemo Models be supported in Tensorrt-LLM in near future?

Jul 18 '24 20:07 hongjunchoi92

@byshiue Looking forward to any progress

Jul 22 '24 09:07 fan-niu

Hello @byshiue

It seems like Mistral 7B model is already supported https://github.com/NVIDIA/TensorRT-LLM/blob/5ddb6bf218ed16a2dcf0058f20c59a247e180fd2/examples/llama/README.md?plain=1#L1072

If the model architecture is the same, would that mean that we can also use existing scripts / code for Mistral-Nemo as well? Or would the model architecture difference require new code changes?

Would be happy to try out with existing scripts. Please let us know.

cc: @AdamzNV @ncomly-nvidia as well.

Jul 22 '24 21:07 hongjunchoi92

@byshiue @AdamzNV @ncomly-nvidia Can you help solve this problem? Yesterday I tried to directly use the mistral method to convert and compile the mistral nemo 12b engine, but an error occurred during the conversion phase. I use the smoothquant conversion method. The following is the conversion script and error log. CC: @hongjunchoi92

Convert script: tensorrtllm commit : ab49b937 (use this commit for llama3 + rope scaling) tensorrtllm backend commit: 97feb8f python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ${model_path} --output_dir ${convert_model_path} --dtype float16 --smoothquant 0.5 --per_token --per_channel --tp_size 1

Error log: [TensorRT-LLM] TensorRT-LLM version: 0.11.0 0.11.0 Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s] Traceback (most recent call last): File "/code/./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 461, in <module> main() File "/code/./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 453, in main convert_and_save_hf(args) File "/code/./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py", line 339, in convert_and_save_hf LLaMAForCausalLM.quantize( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 411, in quantize convert.quantize(hf_model_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1226, in quantize hf_model = AutoModelForCausalLM.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3838, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4298, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 895, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py", line 362, in set_module_tensor_to_device raise ValueError( ValueError: Trying to set a tensor of shape torch.Size([1024, 5120]) in "weight" (which has shape torch.Size([1280, 5120])), this look incorrect. ][TensorRT-LLM] TensorRT-LLM version: 0.11.0

Jul 23 '24 00:07 fan-niu

Hello everyone!

Same issue here. Any news about the integration of this model? Is it related to transformers version and this PR? https://github.com/huggingface/transformers/pull/32050

The logs are the following (pp_size and tp_size at 1)

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 465, in load
    param.value = weights[name]
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 133, in value
    assert v.shape == self.shape, \
AssertionError: The value updated is not the same shape as the original. Updated: (6144, 5120), original: (7680, 5120)

Aug 01 '24 16:08 eleapttn

@nv-guomingz Could you please take a look? Thanks

Aug 04 '24 11:08 QiJune

Hi @eleapttn ,we've fixed this issue internally and corresponding fixing will be pushed to main branch in coming weekly update.

Aug 04 '24 13:08 nv-guomingz

Hi @QiJune, @nv-guomingz, Thanks a lot for your quick reply. I can't wait to test it!

Aug 05 '24 07:08 eleapttn

This is working in 0.12. Good job! Does anyone have any advice or documentation that can help to optimize engine builds for Mistral Nemo? I am currently experimenting with fp8 quants on an H100 and finding them to be about 1/3 the speed of a similar quant of Llama 3.1 8B. I expected Nemo to be a bit slower, but not that much slower.

Sep 03 '24 19:09 MatthewPeyrard

As more and more new models enter the market, we have prepared comprehensive instructions for TRT-LLM developers on adapting to new models of interest. We encourage our community developers to expand the range of supported models, fostering an open ecosystem with rapid iterations.

Please try following these instructions and let us know if you encounter any issues during the adaptation process. We greatly appreciate your dedication.

Oct 31 '24 05:10 AdamzNV