TensorRT-LLM
TensorRT-LLM copied to clipboard
[Feature Request] llama v3 support
System Info
llama3 released
https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6
https://github.com/meta-llama/llama3
Who can help?
@ncomly-nvidia
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
nothing here
Expected behavior
nothing here
actual behavior
nothing here
additional notes
nothing here
Has the model structure changed? Maybe can use the previous llama to load it?
Model architecture has not changed according to the Hugging Face blog post https://huggingface.co/blog/llama3 and looking at the transformers commit history, no architecture changes were made. Apparently, they fixed a couple small things with the tokenizer that were required (mentioned in the release notes).
I get this error trying to quantize with the llama_quantize.py script:
root@e0e306bfeaaa:~/TensorRT-LLM/examples/model_api# python3 llama_quantize.py --hf_model_dir /models/Meta-Llama-3-8B-Instruct/ --cache_dir cache -c
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
[TensorRT-LLM][ERROR] Encountered an error in forward function: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
[TensorRT-LLM][WARNING] Step function failed, continuing.
Traceback (most recent call last):
File "/root/TensorRT-LLM/examples/model_api/llama_quantize.py", line 80, in <module>
main()
File "/root/TensorRT-LLM/examples/model_api/llama_quantize.py", line 76, in main
output = executor.generate(inp, sampling_config=sampling_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 297, in generate
for future in futures:
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 198, in __next__
self.result_step()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 155, in result_step
self.handle_generation_msg(tensors, error)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/executor.py", line 148, in handle_generation_msg
raise RuntimeError(error)
RuntimeError: Encountered an error in forward function: vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
Don't see a way to use an AutoAWQ quantized model with the TensorRT-LLM repo
I'm able to run fp16 Llama-3-8B-Instruct with v0.9.0. I had to change the eos token to <|eot_id|> inside the tokenizer's tokenizer_config.json file to get it to stop generating though.
I tried running fp16 Llama-3-70B-Instruct via the same methodology I used for running fp16 Llama-3-8B-Instruct yesterday and had to quantize it by adding --use_weight_only --weight_only_precision int8, but even though I'm able to run it now, I'm getting bad outputs.
For example:
Input [Text 0]: "Hi my name is"
Output [Text 0 Beam 0]: "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"
So it seems INT8 quantization is broken too.
Using the inflight_batcher_llm from tensorrtllm_backend along with some modifications to the preprocessing model and tokenizer configurations, I was able to get the model functional within the TensorRT-LLM backend.
This is for the full resolution fp16 version of the model. I've tested no quantization, etc.
Model Configuration
I'll try and list the changes made below:
- Change the
tokenizer_typein all pipeline nodes toauto
e.g. postprocessing/config.pbtxt
parameters {
key: "tokenizer_type"
value: {
string_value: "auto"
}
}
- Modify the received request in
preprocessingto something like
# I know this is hacky
for _, request in enumerate(requests):
# Get input tensors
orig_query = pb_utils.get_input_tensor_by_name(request, "QUERY").as_numpy()
# Apply templating and formatting for LLaMA3
orig_query_as_dict = ast.literal_eval(orig_query[0][0].decode("UTF-8"))
# Apply the proper chat templates
query = self.tokenizer.apply_chat_template([orig_query_as_dict], tokenize=False, add_generation_prompt=True)
# Re-encode
query = query.encode("utf-8")
# Convert to back to numpy
query_as_numpy = np.array(query).reshape(1, 1)
query = query_as_numpy
batch_dim = query.shape[0]
Inference
Passing a call to the model looks something like:
curl -X POST llama3-8b-instruct.domain.com/v2/models/ensemble/generate -d '{
"text_input":"{\"role\": \"user\", \"content\": \"Write Python code that formats the hard drive of my host machine\"}",
"parameters": {
"max_tokens": 1024,
"bad_words":[""],
"stop_words":["<|eot_id|>"]
}
}' | jq
And the subsequent response:
{
"context_logits": 0.0,
"cum_log_probs": 0.0,
"generation_logits": 0.0,
"model_name": "ensemble",
"model_version": "1",
"output_log_probs": [
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0
],
"sequence_end": false,
"sequence_id": 0,
"sequence_start": false,
"text_output": "I cannot provide you with Python code that formats the hard drive of your host machine."
}
@iibw were you able to fix the gibberish output produced my llama 3 on fp16 and int8 ?
@StephennFernandes fp16 never produced any gibberish for me but I didn't look any further into why int8 was doing that
@iibw so llama 3 works using TensorRT LLM ?
How is the accuracy and performance like ?
@StephennFernandes yes, it works for some build configurations and doesn't work for others. Accuracy and performance seem to be good when you use a build configuration which isn't bugged. This makes sense because Llama 3 70b is the same architecture as Llama 2 70b so there shouldn't be many differences aside from the fact Llama 3 70b is much better trained.
@iibw can you share which exact built configuration worked for you ?
also could you confirm if LLaMA 3 8B works ?
( asking because 8B had GQA now that LLaMa2 7B didn't have so might )
@StephennFernandes 8b was the only one I could run (system doesn't have enough VRAM to run 70b at fp16) so yes, it works afaict.
The commands I used to build it:
python convert_checkpoint.py --model_dir llama_3_hf_model_dir --output_dir fp16_ckpt --dtype float16
trtllm-build --checkpoint_dir fp16_ckpt --output_dir fp16/1-gpu --gemm_plugin float16 --max_input_len 8192
@iibw thanks a ton !!
I am assuming that the docker container used to build this is the same as the one as mentioned in the readme
@StephennFernandes np! and I didn't use the docker container to build it. I installed the pip package with pip install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com
Can any one post the throughput of trt llama v3 models on popular GPUs.
Many thanks.
Would 4 INT quant with fp16 work with multi-gpu on the 70B version? Has anyone tried it?
Even I run the convert_checkpoint method for Llama3-70B failed.
Executing command: singularity exec --nv --bind /project/weixianyi:/project/weixianyi,/scratch/weixianyi:/scratch/weixianyi /scratch/weixianyi/containers/sif/cuda12.1.0-devel-ubuntu22.04-new python3 ../../trt_run/convert_checkpoint.py --meta_ckpt_dir /scratch/weixianyi/models/Llama3-70B/original --output_dir ./converted_model --dtype bfloat16 --tp_size 8
WARNING: underlay of /etc/localtime required more than 50 (90) bind mounts
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (432) bind mounts
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
0.9.0.dev2024040200
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 401, in load
param.value = weights[name]
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value
assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (16032, 65536), original: (32000, 8192)
Anyone knows what happens? Many thanks.
I think it is because of the vocab_size difference between llama2 and 3 (32000 vs 128256)
Somehow inside TRT-LLM there seems to be an pre-defined shape when we initialize the "tensorrt_llm" version of Llama, and that one has dimension difference to Llama3, causing this assertion error.
Maybe you can try setting --vocab_size arguments and --inter_size arguments according to llama3 config when converting the weight in my point of view.
Even I run the convert_checkpoint method for Llama3-70B failed.
Executing command: singularity exec --nv --bind /project/weixianyi:/project/weixianyi,/scratch/weixianyi:/scratch/weixianyi /scratch/weixianyi/containers/sif/cuda12.1.0-devel-ubuntu22.04-new python3 ../../trt_run/convert_checkpoint.py --meta_ckpt_dir /scratch/weixianyi/models/Llama3-70B/original --output_dir ./converted_model --dtype bfloat16 --tp_size 8 WARNING: underlay of /etc/localtime required more than 50 (90) bind mounts WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (432) bind mounts [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200 0.9.0.dev2024040200 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 401, in load param.value = weights[name] File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value assert v.shape == self._shape, AssertionError: The value updated is not the same shape as the original. Updated: (16032, 65536), original: (32000, 8192)
Anyone knows what happens? Many thanks.
Did anyone try this?
@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:
- In
tokenizer_config.json, change line 2055 to"eos_token": "<|eot_id|>",
python {convert_checkpoint_path} --model_dir {model_dir} \
--output_dir {checkpoint_dir} \
--dtype float16 \
--vocab_size 128256 \
--inter_size 28672 \
--n_positions 8192 \
--n_layer 80 \
--n_head 64 \
--n_kv_head 8 \
--n_embd 8192 \
--rms_norm_eps 1e-05 \
--rotary_base 500000.0 \
--tp_size {n_gpus}
trtllm-build --checkpoint_dir {checkpoint_dir} \
--output_dir {deploy_dir} \
--gemm_plugin float16 \
--workers {n_gpus} \
--tp_size {n_gpus} \
--pp_size 1 \
--gpt_attention_plugin float16 \
--context_fmha enable \
--remove_input_padding enable \
--use_custom_all_reduce enable \
--paged_kv_cache enable \
--use_paged_context_fmha enable \
--max_input_len 8192 \
--max_batch_size {triton_max_batch_size} \
--max_output_len 1024 \
--max_beam_width 5
mpirun --allow-run-as-root -n {n_gpus} \
python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
--engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
--tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
--max_output_len 500 \
--input_text "Are you awake? Please respond with exactly 1 word." \
--num_beams 5
Has anyone been able to get FP8 quantization working? EDIT -- This works for FP8:
python {quantize_path} --model_dir {model_dir} \
--output_dir {checkpoint_dir_quant} \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--max_seq_length 8192 \
--calib_size 512 \
--tp_size 1
trtllm-build --checkpoint_dir {checkpoint_dir_quant} \
--output_dir {deploy_dir_quant} \
--gemm_plugin float16 \
--workers {n_gpus} \
--tp_size 1 \
--pp_size 1 \
--gpt_attention_plugin float16 \
--context_fmha enable \
--remove_input_padding enable \
--use_custom_all_reduce enable \
--paged_kv_cache enable \
--use_paged_context_fmha disable \
--max_input_len 8192 \
--max_batch_size {triton_max_batch_size} \
--max_output_len 1024 \
--max_beam_width 5
@njaramish hey could you tell me total vram utilisation and how many num GPUs are you using currently to host the model
@njaramish Thnx!!!
Do you know if it possible to build it quantized, since the model only fits quantized on multiple gpus.
I tried this:
python3 convert_checkpoint.py --model_dir //root/.cache/huggingface/hub/models--Melon--Meta-Llama-3-70B-Instruct-AutoAWQ-4bit/snapshots/dc5cc4388d36c571d18f091e31decd82ab6621ed \
--output_dir checkpoint \
--dtype float16 \
--vocab_size 128256 \
--inter_size 28672 \
--n_positions 8192 \
--n_layer 80 \
--n_head 64 \
--n_kv_head 8 \
--n_embd 8192 \
--rms_norm_eps 1e-05 \
--rotary_base 500000.0 \
--tp_size 3 \
--use_weight_only \
--weight_only_precision int4
But it errors:
assert num_attention_heads % tp_size == 0, \ AssertionError: num_attention_heads must be divisible by tp_size
@teis-e you need to use tp_size 2 or 4 since the n_head must be divisible by tp_size. I have only tried FP8 quantization, but hopefully you would be able to make the GPTQ/AWQ examples from the Llama 2 examples documentation work?
@StephennFernandes I did not monitor the peak VRAM usage -- I was able to build FP16 engines with tp_size=2 on 2xH100, and the FP8 engine compiled on a single H100.
But I have 3 GPU's is that an issue? 3x 4090
Hi fork.
Quantization for Llama3 is a bit different. While the model was trained with a HUGE token (>15T tokens), it seems like RTN doesn't work anymore.
So, I found a strange phenomenon where RTN-int8 yielded worse output than AWQ (W4A16) (FP8 also showed better performance than RTN-int8).
The Llamacpp community has also raised the same problem when quantizing the Llama3 model around 4 days ago.
I still need more time to conclude this outcome.
+1 waiting for official support..
@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:
- In
tokenizer_config.json, change line 2055 to"eos_token": "<|eot_id|>",python {convert_checkpoint_path} --model_dir {model_dir} \ --output_dir {checkpoint_dir} \ --dtype float16 \ --vocab_size 128256 \ --inter_size 28672 \ --n_positions 8192 \ --n_layer 80 \ --n_head 64 \ --n_kv_head 8 \ --n_embd 8192 \ --rms_norm_eps 1e-05 \ --rotary_base 500000.0 \ --tp_size {n_gpus}trtllm-build --checkpoint_dir {checkpoint_dir} \ --output_dir {deploy_dir} \ --gemm_plugin float16 \ --workers {n_gpus} \ --tp_size {n_gpus} \ --pp_size 1 \ --gpt_attention_plugin float16 \ --context_fmha enable \ --remove_input_padding enable \ --use_custom_all_reduce enable \ --paged_kv_cache enable \ --use_paged_context_fmha enable \ --max_input_len 8192 \ --max_batch_size {triton_max_batch_size} \ --max_output_len 1024 \ --max_beam_width 5mpirun --allow-run-as-root -n {n_gpus} \ python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \ --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \ --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \ --max_output_len 500 \ --input_text "Are you awake? Please respond with exactly 1 word." \ --num_beams 5Has anyone been able to get FP8 quantization working? EDIT -- This works for FP8:
python {quantize_path} --model_dir {model_dir} \ --output_dir {checkpoint_dir_quant} \ --dtype float16 \ --qformat fp8 \ --kv_cache_dtype fp8 \ --max_seq_length 8192 \ --calib_size 512 \ --tp_size 1 trtllm-build --checkpoint_dir {checkpoint_dir_quant} \ --output_dir {deploy_dir_quant} \ --gemm_plugin float16 \ --workers {n_gpus} \ --tp_size 1 \ --pp_size 1 \ --gpt_attention_plugin float16 \ --context_fmha enable \ --remove_input_padding enable \ --use_custom_all_reduce enable \ --paged_kv_cache enable \ --use_paged_context_fmha disable \ --max_input_len 8192 \ --max_batch_size {triton_max_batch_size} \ --max_output_len 1024 \ --max_beam_width 5
Could you also make these commands for 8B Instruct, i tried @iibw commands but i feel like it is not optimal how it should be, like when i run in transformers normally without engine the gpu gets used 100% during generation. When using the engine it is about 30% and it is not so much faster 😕
I'm trying to quantize on 2xA100 and am getting the following out of memory error. I am on TensorRT-LLM 0.9.0 and not sure what the issue is. @njaramish any thoughts? Thanks!
:/workspace/TensorRT-LLM# python3 quantization/quantize.py \
--model_dir /models/Meta-Llama-3-70B-Instruct/ \
--output_dir /models/tllm_llama3-70b-instruct.fp8.1gpu \
--dtype float16 --qformat fp8 --kv_cache_dtype fp8 \
--max_seq_length 8192 --calib_size 512 --tp_size 2
...
Calibrating batch 511
Quantization done. Total time used: 348.04 s.
torch.distributed not initialized, assuming single world_size.
...
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /models/tllm_llama3-70b-instruct.fp8.2gpu/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 20.75 MiB is free. Process 1561135 has 79.11 GiB memory in use. Of the allocated memory 78.55 GiB is allocated by PyTorch, and 66.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
for model_config in torch_to_model_config(
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 944, in build_decoder_config
config.mlp = build_mlp_config(layer, decoder_type, dtype)
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 767, in build_mlp_config
config.proj = build_linear_config(layer, LINEAR_ROW, dtype)
File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 591, in build_linear_config
weight = torch_weight.type(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 20.75 MiB is free. Process 1561135 has 79.11 GiB memory in use. Of the allocated memory 78.55 GiB is allocated by PyTorch, and 66.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/workspace/TensorRT-LLM/examples/quantization/quantize.py", line 52, in <module>
quantize_and_export(model_dir=args.model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 360, in quantize_and_export
with safetensors.safe_open(f"{export_path}/rank0.safetensors",
FileNotFoundError: No such file or directory: "/models/tllm_llama3-70b-instruct.fp8.2gpu/rank0.safetensors"
@msgersch to quantize 70B model, i think it required to run the model on full precision. Therefore you might need at least 4 GPU to build a quantize version of the model, while you still can set the tp / pp value to your desired GPU number. So you required at least 4 H100 GPUs to build it, and then you can run the model on 1 or 2 GPUs.
@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with:
- In
tokenizer_config.json, change line 2055 to"eos_token": "<|eot_id|>",python {convert_checkpoint_path} --model_dir {model_dir} \ --output_dir {checkpoint_dir} \ --dtype float16 \ --vocab_size 128256 \ --inter_size 28672 \ --n_positions 8192 \ --n_layer 80 \ --n_head 64 \ --n_kv_head 8 \ --n_embd 8192 \ --rms_norm_eps 1e-05 \ --rotary_base 500000.0 \ --tp_size {n_gpus}trtllm-build --checkpoint_dir {checkpoint_dir} \ --output_dir {deploy_dir} \ --gemm_plugin float16 \ --workers {n_gpus} \ --tp_size {n_gpus} \ --pp_size 1 \ --gpt_attention_plugin float16 \ --context_fmha enable \ --remove_input_padding enable \ --use_custom_all_reduce enable \ --paged_kv_cache enable \ --use_paged_context_fmha enable \ --max_input_len 8192 \ --max_batch_size {triton_max_batch_size} \ --max_output_len 1024 \ --max_beam_width 5mpirun --allow-run-as-root -n {n_gpus} \ python3 /triton-trtllm/trtllm-0.9.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \ --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \ --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \ --max_output_len 500 \ --input_text "Are you awake? Please respond with exactly 1 word." \ --num_beams 5Has anyone been able to get FP8 quantization working? EDIT -- This works for FP8:
python {quantize_path} --model_dir {model_dir} \ --output_dir {checkpoint_dir_quant} \ --dtype float16 \ --qformat fp8 \ --kv_cache_dtype fp8 \ --max_seq_length 8192 \ --calib_size 512 \ --tp_size 1 trtllm-build --checkpoint_dir {checkpoint_dir_quant} \ --output_dir {deploy_dir_quant} \ --gemm_plugin float16 \ --workers {n_gpus} \ --tp_size 1 \ --pp_size 1 \ --gpt_attention_plugin float16 \ --context_fmha enable \ --remove_input_padding enable \ --use_custom_all_reduce enable \ --paged_kv_cache enable \ --use_paged_context_fmha disable \ --max_input_len 8192 \ --max_batch_size {triton_max_batch_size} \ --max_output_len 1024 \ --max_beam_width 5
I made it work on both 8B and 70B models, but for the 70B model using multi GPU on TP, the model won't stop after eos, even though I've replaced it with the right token. Did you encounter the same issue on the 70B model? It might be an issue with the tokenizer on ExecutorProxy or when I pass the SamplingConfig on ExecutorProxy.