TensorRT-LLM [Feature Request] llama v3 support

System Info

llama3 released

https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6

https://github.com/meta-llama/llama3

Who can help?

@ncomly-nvidia

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

nothing here

Expected behavior

nothing here

actual behavior

nothing here

additional notes

nothing here

Apr 18 '24 22:04 gulldan

Has the model structure changed? Maybe can use the previous llama to load it?

Apr 19 '24 02:04 shiqingzhangCSU

Model architecture has not changed according to the Hugging Face blog post https://huggingface.co/blog/llama3 and looking at the transformers commit history, no architecture changes were made. Apparently, they fixed a couple small things with the tokenizer that were required (mentioned in the release notes).

Apr 19 '24 03:04 iibw

I get this error trying to quantize with the llama_quantize.py script:

root@e0e306bfeaaa:~/TensorRT-LLM/examples/model_api# python3 llama_quantize.py --hf_model_dir /models/Meta-Llama-3-8B-Instruct/ --cache_dir cache -c

[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] Encountered an error in forward function: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
[TensorRT-LLM][WARNING] Step function failed, continuing.
Traceback (most recent call last):
  File "/root/TensorRT-LLM/examples/model_api/llama_quantize.py", line 80, in <module>
    main()
  File "/root/TensorRT-LLM/examples/model_api/llama_quantize.py", line 76, in main
    output = executor.generate(inp, sampling_config=sampling_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 297, in generate
    for future in futures:
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 198, in __next__
    self.result_step()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 155, in result_step
    self.handle_generation_msg(tensors, error)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 148, in handle_generation_msg
    raise RuntimeError(error)
RuntimeError: Encountered an error in forward function: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

Don't see a way to use an AutoAWQ quantized model with the TensorRT-LLM repo

Apr 19 '24 03:04 catid

I'm able to run fp16 Llama-3-8B-Instruct with v0.9.0. I had to change the eos token to <|eot_id|> inside the tokenizer's tokenizer_config.json file to get it to stop generating though.

Apr 19 '24 05:04 iibw

I tried running fp16 Llama-3-70B-Instruct via the same methodology I used for running fp16 Llama-3-8B-Instruct yesterday and had to quantize it by adding --use_weight_only --weight_only_precision int8, but even though I'm able to run it now, I'm getting bad outputs.

For example:

Input [Text 0]: "Hi my name is"
Output [Text 0 Beam 0]: "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"

So it seems INT8 quantization is broken too.

Apr 20 '24 07:04 iibw

Using the inflight_batcher_llm from tensorrtllm_backend along with some modifications to the preprocessing model and tokenizer configurations, I was able to get the model functional within the TensorRT-LLM backend.

This is for the full resolution fp16 version of the model. I've tested no quantization, etc.

Model Configuration

I'll try and list the changes made below:

Change the tokenizer_type in all pipeline nodes to auto

e.g. postprocessing/config.pbtxt

parameters {
  key: "tokenizer_type"
  value: {
    string_value: "auto"
  }
}

Modify the received request in preprocessing to something like

# I know this is hacky
for _, request in enumerate(requests):
    # Get input tensors
    orig_query = pb_utils.get_input_tensor_by_name(request, "QUERY").as_numpy()

    # Apply templating and formatting for LLaMA3
    orig_query_as_dict = ast.literal_eval(orig_query[0][0].decode("UTF-8"))
    
    # Apply the proper chat templates
    query = self.tokenizer.apply_chat_template([orig_query_as_dict], tokenize=False, add_generation_prompt=True)

    # Re-encode
    query = query.encode("utf-8")

    # Convert to back to numpy
    query_as_numpy = np.array(query).reshape(1, 1)
    query = query_as_numpy
    batch_dim = query.shape[0]

Inference

Passing a call to the model looks something like:

curl -X POST llama3-8b-instruct.domain.com/v2/models/ensemble/generate -d '{
"text_input":"{\"role\": \"user\", \"content\": \"Write Python code that formats the hard drive of my host machine\"}",
"parameters": {
"max_tokens": 1024,
"bad_words":[""],
"stop_words":["<|eot_id|>"]
}
}' | jq

And the subsequent response:

{
  "context_logits": 0.0,
  "cum_log_probs": 0.0,
  "generation_logits": 0.0,
  "model_name": "ensemble",
  "model_version": "1",
  "output_log_probs": [
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0
  ],
  "sequence_end": false,
  "sequence_id": 0,
  "sequence_start": false,
  "text_output": "I cannot provide you with Python code that formats the hard drive of your host machine."
}

Apr 22 '24 20:04 edhenry

@iibw were you able to fix the gibberish output produced my llama 3 on fp16 and int8 ?

Apr 22 '24 22:04 StephennFernandes

@StephennFernandes fp16 never produced any gibberish for me but I didn't look any further into why int8 was doing that

Apr 23 '24 00:04 iibw

@iibw so llama 3 works using TensorRT LLM ?

How is the accuracy and performance like ?

Apr 23 '24 01:04 StephennFernandes

@StephennFernandes yes, it works for some build configurations and doesn't work for others. Accuracy and performance seem to be good when you use a build configuration which isn't bugged. This makes sense because Llama 3 70b is the same architecture as Llama 2 70b so there shouldn't be many differences aside from the fact Llama 3 70b is much better trained.

Apr 23 '24 01:04 iibw

@iibw can you share which exact built configuration worked for you ?

also could you confirm if LLaMA 3 8B works ?

( asking because 8B had GQA now that LLaMa2 7B didn't have so might )

Apr 23 '24 01:04 StephennFernandes

@StephennFernandes 8b was the only one I could run (system doesn't have enough VRAM to run 70b at fp16) so yes, it works afaict.

The commands I used to build it:

python convert_checkpoint.py --model_dir llama_3_hf_model_dir --output_dir fp16_ckpt --dtype float16

trtllm-build --checkpoint_dir fp16_ckpt --output_dir fp16/1-gpu --gemm_plugin float16 --max_input_len 8192

Apr 23 '24 01:04 iibw

@iibw thanks a ton !!

I am assuming that the docker container used to build this is the same as the one as mentioned in the readme

Apr 23 '24 02:04 StephennFernandes

@StephennFernandes np! and I didn't use the docker container to build it. I installed the pip package with pip install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

Apr 23 '24 02:04 iibw

Can any one post the throughput of trt llama v3 models on popular GPUs.

Many thanks.

Apr 25 '24 01:04 HaisongDing

Would 4 INT quant with fp16 work with multi-gpu on the 70B version? Has anyone tried it?

Apr 28 '24 12:04 teis-e

Even I run the convert_checkpoint method for Llama3-70B failed.

Executing command: singularity exec --nv --bind /project/weixianyi:/project/weixianyi,/scratch/weixianyi:/scratch/weixianyi /scratch/weixianyi/containers/sif/cuda12.1.0-devel-ubuntu22.04-new python3 ../../trt_run/convert_checkpoint.py --meta_ckpt_dir /scratch/weixianyi/models/Llama3-70B/original --output_dir ./converted_model --dtype bfloat16 --tp_size 8 WARNING: underlay of /etc/localtime required more than 50 (90) bind mounts WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (432) bind mounts [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200 0.9.0.dev2024040200 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 401, in load param.value = weights[name] File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (16032, 65536), original: (32000, 8192)

Anyone knows what happens? Many thanks.

Apr 28 '24 13:04 AI-Kyo-er

I think it is because of the vocab_size difference between llama2 and 3 (32000 vs 128256)

Somehow inside TRT-LLM there seems to be an pre-defined shape when we initialize the "tensorrt_llm" version of Llama, and that one has dimension difference to Llama3, causing this assertion error.

Maybe you can try setting --vocab_size arguments and --inter_size arguments according to llama3 config when converting the weight in my point of view.

Even I run the convert_checkpoint method for Llama3-70B failed.

Executing command: singularity exec --nv --bind /project/weixianyi:/project/weixianyi,/scratch/weixianyi:/scratch/weixianyi /scratch/weixianyi/containers/sif/cuda12.1.0-devel-ubuntu22.04-new python3 ../../trt_run/convert_checkpoint.py --meta_ckpt_dir /scratch/weixianyi/models/Llama3-70B/original --output_dir ./converted_model --dtype bfloat16 --tp_size 8 WARNING: underlay of /etc/localtime required more than 50 (90) bind mounts WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (432) bind mounts [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200 0.9.0.dev2024040200 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 401, in load param.value = weights[name] File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value assert v.shape == self._shape, AssertionError: The value updated is not the same shape as the original. Updated: (16032, 65536), original: (32000, 8192)

Anyone knows what happens? Many thanks.

Apr 29 '24 05:04 Yuchen-Cao

Did anyone try this?

Apr 30 '24 22:04 teis-e

@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:

In tokenizer_config.json, change line 2055 to "eos_token": "<|eot_id|>",

python {convert_checkpoint_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size {n_gpus}

trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size {n_gpus} \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha enable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

mpirun --allow-run-as-root -n {n_gpus} \
          python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 500 \
          --input_text "Are you awake? Please respond with exactly 1 word." \
          --num_beams 5

Has anyone been able to get FP8 quantization working? EDIT -- This works for FP8:

python {quantize_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir_quant} \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --max_seq_length 8192 \
                                --calib_size 512 \
                                --tp_size 1 
                                
 trtllm-build --checkpoint_dir {checkpoint_dir_quant} \
                 --output_dir {deploy_dir_quant} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size 1 \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha disable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

May 01 '24 17:05 njaramish

@njaramish hey could you tell me total vram utilisation and how many num GPUs are you using currently to host the model

May 02 '24 09:05 StephennFernandes

@njaramish Thnx!!!

Do you know if it possible to build it quantized, since the model only fits quantized on multiple gpus.

I tried this:

python3 convert_checkpoint.py --model_dir //root/.cache/huggingface/hub/models--Melon--Meta-Llama-3-70B-Instruct-AutoAWQ-4bit/snapshots/dc5cc4388d36c571d18f091e31decd82ab6621ed \
                                --output_dir checkpoint \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size 3 \
                                --use_weight_only \
                                --weight_only_precision int4

But it errors: assert num_attention_heads % tp_size == 0, \ AssertionError: num_attention_heads must be divisible by tp_size

May 02 '24 12:05 teis-e

@teis-e you need to use tp_size 2 or 4 since the n_head must be divisible by tp_size. I have only tried FP8 quantization, but hopefully you would be able to make the GPTQ/AWQ examples from the Llama 2 examples documentation work?

@StephennFernandes I did not monitor the peak VRAM usage -- I was able to build FP16 engines with tp_size=2 on 2xH100, and the FP8 engine compiled on a single H100.

May 02 '24 13:05 njaramish

But I have 3 GPU's is that an issue? 3x 4090

May 02 '24 14:05 teis-e

Hi fork.

Quantization for Llama3 is a bit different. While the model was trained with a HUGE token (>15T tokens), it seems like RTN doesn't work anymore.

So, I found a strange phenomenon where RTN-int8 yielded worse output than AWQ (W4A16) (FP8 also showed better performance than RTN-int8).

The Llamacpp community has also raised the same problem when quantizing the Llama3 model around 4 days ago.

I still need more time to conclude this outcome.

May 02 '24 14:05 matichon-vultureprime

+1 waiting for official support..

May 04 '24 14:05 oscarbg

@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:

In tokenizer_config.json, change line 2055 to "eos_token": "<|eot_id|>",

python {convert_checkpoint_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size {n_gpus}

trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size {n_gpus} \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha enable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

mpirun --allow-run-as-root -n {n_gpus} \
          python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 500 \
          --input_text "Are you awake? Please respond with exactly 1 word." \
          --num_beams 5

Has anyone been able to get FP8 quantization working? EDIT -- This works for FP8:

python {quantize_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir_quant} \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --max_seq_length 8192 \
                                --calib_size 512 \
                                --tp_size 1 
                                
 trtllm-build --checkpoint_dir {checkpoint_dir_quant} \
                 --output_dir {deploy_dir_quant} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size 1 \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha disable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

Could you also make these commands for 8B Instruct, i tried @iibw commands but i feel like it is not optimal how it should be, like when i run in transformers normally without engine the gpu gets used 100% during generation. When using the engine it is about 30% and it is not so much faster 😕

May 06 '24 14:05 teis-e

I'm trying to quantize on 2xA100 and am getting the following out of memory error. I am on TensorRT-LLM 0.9.0 and not sure what the issue is. @njaramish any thoughts? Thanks!

:/workspace/TensorRT-LLM# python3 quantization/quantize.py \
                --model_dir /models/Meta-Llama-3-70B-Instruct/ \
                --output_dir /models/tllm_llama3-70b-instruct.fp8.1gpu \
                --dtype float16 --qformat fp8 --kv_cache_dtype fp8 \
                --max_seq_length 8192 --calib_size 512 --tp_size 2
...
Calibrating batch 511
Quantization done. Total time used: 348.04 s.
torch.distributed not initialized, assuming single world_size.
...
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /models/tllm_llama3-70b-instruct.fp8.2gpu/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 20.75 MiB is free. Process 1561135 has 79.11 GiB memory in use. Of the allocated memory 78.55 GiB is allocated by PyTorch, and 66.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
    for model_config in torch_to_model_config(
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 944, in build_decoder_config
    config.mlp = build_mlp_config(layer, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 767, in build_mlp_config
    config.proj = build_linear_config(layer, LINEAR_ROW, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 591, in build_linear_config
    weight = torch_weight.type(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 20.75 MiB is free. Process 1561135 has 79.11 GiB memory in use. Of the allocated memory 78.55 GiB is allocated by PyTorch, and 66.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/workspace/TensorRT-LLM/examples/quantization/quantize.py", line 52, in <module>
    quantize_and_export(model_dir=args.model_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 360, in quantize_and_export
    with safetensors.safe_open(f"{export_path}/rank0.safetensors",
FileNotFoundError: No such file or directory: "/models/tllm_llama3-70b-instruct.fp8.2gpu/rank0.safetensors"

May 10 '24 01:05 msgersch

@msgersch to quantize 70B model, i think it required to run the model on full precision. Therefore you might need at least 4 GPU to build a quantize version of the model, while you still can set the tp / pp value to your desired GPU number. So you required at least 4 H100 GPUs to build it, and then you can run the model on 1 or 2 GPUs.

May 11 '24 03:05 rifkybujana

@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:

In tokenizer_config.json, change line 2055 to "eos_token": "<|eot_id|>",

python {convert_checkpoint_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --vocab_size 128256 \
                                --inter_size 28672 \
                                --n_positions 8192 \
                                --n_layer 80 \
                                --n_head 64 \
                                --n_kv_head 8 \
                                --n_embd 8192 \
                                --rms_norm_eps 1e-05 \
                                --rotary_base 500000.0 \
                                --tp_size {n_gpus}

trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size {n_gpus} \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha enable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

mpirun --allow-run-as-root -n {n_gpus} \
          python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 500 \
          --input_text "Are you awake? Please respond with exactly 1 word." \
          --num_beams 5

Has anyone been able to get FP8 quantization working? EDIT -- This works for FP8:

python {quantize_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir_quant} \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --max_seq_length 8192 \
                                --calib_size 512 \
                                --tp_size 1 
                                
 trtllm-build --checkpoint_dir {checkpoint_dir_quant} \
                 --output_dir {deploy_dir_quant} \
                 --gemm_plugin float16 \
                 --workers {n_gpus} \
                 --tp_size 1 \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce enable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha disable \
                 --max_input_len 8192 \
                 --max_batch_size {triton_max_batch_size} \
                 --max_output_len 1024 \
                 --max_beam_width 5

I made it work on both 8B and 70B models, but for the 70B model using multi GPU on TP, the model won't stop after eos, even though I've replaced it with the right token. Did you encounter the same issue on the 70B model? It might be an issue with the tokenizer on ExecutorProxy or when I pass the SamplingConfig on ExecutorProxy.

May 11 '24 03:05 rifkybujana

TensorRT-LLM TensorRT-LLM copied to clipboard

[Feature Request] llama v3 support

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Model Configuration

Inference

TensorRT-LLM
TensorRT-LLM copied to clipboard