djl
djl copied to clipboard
Model conversion process failed when deploying Mixtral 8x22B AWQ with djl-tensorrtllm to Sagemaker
Description
Model conversion process failed with djl-tensorrtllm and below serving.properties:
image_uri = image_uris.retrieve(
framework="djl-tensorrtllm",
region=sess.boto_session.region_name,
version="0.28.0"
)
%%writefile serving.properties
engine=MPI
option.model_id=MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ
option.tensor_parallel_degree=4
option.quantize=awq
option.max_num_tokens=8192
option.max_rolling_batch_size=8
Expected Behavior
(what's the expected behavior?)
Error Message
| 1721194930489 | [INFO ] LmiUtils - Detected mpi_mode: true, rolling_batch: trtllm, tensor_parallel_degree 4, for modelType: mixtral |
| 1721194930489 | [INFO ] ModelInfo - M-0001: Apply per model settings: job_queue_size: 1000 max_dynamic_batch_size: 1 max_batch_delay: 100 max_idle_time: 60 load_on_devices: * engine: MPI mpi_mode: true option.entryPoint: null option.tensor_parallel_degree: 4 option.max_rolling_batch_size: 8 option.quantize: awq option.mpi_mode: true option.max_num_tokens: 8192 option.model_id: MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ option.rolling_batch: trtllm |
| 1721194933027 | [INFO ] LmiUtils - Converting model to TensorRT-LLM artifacts |
| 1721194933027 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] PyTorch version 2.2.1 available. |
| 1721194933493 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] JAX version 0.4.30 available. |
| 1721194933493 | [INFO ] LmiUtils - convert_py: [TensorRT-LLM] TensorRT-LLM version: 0.9.0 |
| 1721194933493 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Received kwargs for tensorrt_llm_toolkit.create_model_repo: dict_items([('engine', 'MPI'), ('model_id', 'MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ'), ('tensor_parallel_degree', 4), ('quantize', 'awq'), ('max_num_tokens', '8192'), ('max_rolling_batch_size', '8'), ('trt_llm_model_repo', '/tmp/.djl.ai/trtllm/c1e40db56ea23fb1ec359dff353cdb9a752a827c')]) |
| 1721194933493 | [INFO ] LmiUtils - convert_py: /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. |
| 1721194933743 | [INFO ] LmiUtils - convert_py: warnings.warn( |
| 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Selecting ModelBuilder |
| 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Configuring model (will download if not available locally): MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ |
| 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Using llama scripts for model type: mixtral |
| 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Compiling HuggingFace model into TensorRT engine... |
| 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Updating TRT config... |
| 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][WARNING] The following overrides are final. Some of them are specifically set by LMI to provide the best compilation experience. |
| 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][WARNING] Model Config Override: qformat=int4_awq |
| 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][WARNING] Model Config Override: calib_size=512 |
| 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][WARNING] Model Config Override: kv_cache_dtype=int8 |
| 1721194933743 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Quantizing HF checkpoint to TRT checkpoint... |
| 1721194938596 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][INFO] Running command: python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/quantization/quantize.py --model_dir MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ --dtype float16 --output_dir /tmp/trtllm_llama_ckpt/ --qformat int4_awq --kv_cache_dtype int8 --calib_size 512 --batch_size 32 --tp_size 4 --awq_block_size 64 |
| 1721194939003 | [INFO ] LmiUtils - convert_py: [LMI TRTLLM Toolkit][152][ERROR] Exit code: 1 for command: python3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/quantization/quantize.py --model_dir MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ --dtype float16 --output_dir /tmp/trtllm_llama_ckpt/ --qformat int4_awq --kv_cache_dtype int8 --calib_size 512 --batch_size 32 --tp_size 4 --awq_block_size 64 |
| 1721194939003 | [INFO ] LmiUtils - convert_py: [TensorRT-LLM] TensorRT-LLM version: 0.9.0 |
| 1721194939003 | [INFO ] LmiUtils - convert_py: /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. |
| 1721194939003 | [INFO ] LmiUtils - convert_py: warnings.warn( |
| 1721194939003 | [INFO ] LmiUtils - convert_py: Initializing model from MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-AWQ |
| 1721194939003 | [INFO ] LmiUtils - convert_py: Traceback (most recent call last): |
| 1721194939003 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/quantization/quantize.py", line 52, in autoawq does not support module quantization skipping, please upgrade autoawq package to at least 0.1.8. |
| 1721194939003 | [INFO ] LmiUtils - convert_py: Traceback (most recent call last): |
| 1721194939003 | [INFO ] LmiUtils - convert_py: File "/opt/djl/partition/trt_llm_partition.py", line 69, in
How to Reproduce?
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
Steps to reproduce
(Paste the commands you ran that produced the error.)
What have you tried to solve it?
Environment Info
Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:
PASTE OUTPUT HERE
@ydm-amazon Please take a look.
It seems that dji-tensorrtllm cannot convert an quantized model, not sure if it was the issue. Hence I tried mistralai/Mixtral-8x7B-Instruct-v0.1 and the conversion failed again with below message:
model = sagemaker.Model(
image_uri=image_uri,
role=role,
# specify all environment variable configs in this map
env={
"HF_MODEL_ID": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"TENSOR_PARALLEL_DEGREE": "max",
"OPTION_MAX_NUM_TOKENS": "8192",
"OPTION_QUANTIZE": "awq",
"HF_TOKEN": "hf_xNBRqleBjkvQPxxxxxxxxxxxxxxx",
}
)
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 71%|??????? | 203409/287113 [00:02<00:01, 77139.42 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 74%|???????? | 211409/287113 [00:02<00:00, 76954.50 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 76%|???????? | 219409/287113 [00:03<00:00, 76684.58 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 80%|???????? | 228409/287113 [00:03<00:00, 76042.56 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 82%|????????? | 236409/287113 [00:03<00:00, 74224.87 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 85%|????????? | 244409/287113 [00:03<00:00, 72637.33 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 88%|????????? | 252409/287113 [00:03<00:00, 71742.55 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 91%|????????? | 260409/287113 [00:03<00:00, 70389.71 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 93%|??????????| 268409/287113 [00:03<00:00, 70755.62 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 96%|??????????| 276409/287113 [00:03<00:00, 71027.88 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 99%|??????????| 284409/287113 [00:03<00:00, 69815.91 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating train split: 100%|??????????| 287113/287113 [00:03<00:00, 72160.21 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating validation split: 0%| | 0/13368 [00:00<?, ? examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating validation split: 67%|??????? | 9000/13368 [00:00<00:00, 76879.14 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating validation split: 100%|??????????| 13368/13368 [00:00<00:00, 73118.40 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating test split: 0%| | 0/11490 [00:00<?, ? examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating test split: 78%|???????? | 9000/11490 [00:00<00:00, 75469.55 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Generating test split: 100%|??????????| 11490/11490 [00:00<00:00, 72635.15 examples/s] |
| 1721291393048 | [INFO ] LmiUtils - convert_py: {'quant_cfg': {'weight_quantizer': {'num_bits': 4, 'block_sizes': {-1: 64}, 'enable': True}, 'input_quantizer': {'enable': False}, 'lm_head': {'enable': False}, 'output_layer': {'enable': False}, 'default': {'enable': False}, '.query_key_value.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '.Wqkv.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '.W_pack.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '.c_attn.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '.k_proj.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}, '.v_proj.output_quantizer': {'num_bits': 8, 'axis': None, 'enable': True}}, 'algorithm': {'method': 'awq_lite', 'alpha_step': 0.1}} |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Starting quantization... |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Replaced 2787 modules to quantized modules |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Caching activation statistics for awq_lite... |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Calibrating batch 0 |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Loading extension ammo_cuda_ext... |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Loading extension ammo_cuda_ext_fp8... |
| 1721291393048 | [INFO ] LmiUtils - convert_py: /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). |
| 1721291393048 | [INFO ] LmiUtils - convert_py: self.register_buffer("pre_quant_scale", torch.tensor(value)) |
| 1721291393048 | [INFO ] LmiUtils - convert_py: /usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad(True), rather than torch.tensor(sourceTensor). |
| 1721291393048 | [INFO ] LmiUtils - convert_py: value = torch.tensor(value, device=self._pre_quant_scale.device) |
| 1721291393048 | [INFO ] LmiUtils - convert_py: /usr/local/lib/python3.10/dist-packages/numpy/lib/format.py:362: UserWarning: metadata on a dtype is not saved to an npy/npz. Use another format (such as pickle) to store it. |
| 1721291393048 | [INFO ] LmiUtils - convert_py: d['descr'] = dtype_to_descr(array.dtype) |
| 1721291393048 | [INFO ] LmiUtils - convert_py: Searching awq_lite parameters... |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Calibrating batch 0 |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Calibrating batch 0 |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Quantization done. Total time used: 257.33 s. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Unknown model type MixtralForCausalLM. Continue exporting... |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: export_npz is going to be deprecated soon and replaced by safetensors. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: current rank: 0, tp rank: 0, pp rank: 0 |
| 1721291393049 | [INFO ] LmiUtils - convert_py: torch.distributed not initialized, assuming single world_size. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Warning: this is an old NPZ format and will be deprecated soon. |
| 1721291393049 | [INFO ] LmiUtils - convert_py: Traceback (most recent call last): |
| 1721291393049 | [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/quantization/quantize.py", line 52, in
Thanks for the detailed information; I will look into it more today!