HI DevTeam,
Could you give me a hand to check this issue, thanks so much!
After installed the whl package successfully, follow this guide,
cd rtp-llm
For cuda12 environment, please use requirements_torch_gpu_cuda12.txt
pip3 install -r ./open_source/deps/requirements_torch_gpu.txt
Use the corresponding whl from the release version, here's an example for the cuda11 version 0.1.0, for the cuda12 whl package please check the release page.
pip3 install maga_transformer-0.1.9+cuda118-cp310-cp310-manylinux1_x86_64.whl
start http service
cd ../
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
Issues
It generated the following issue, could you give some suggestions,
(rtp-llm) h@acc:/opt/HF-MODEL$ TOKENIZER_PATH=/opt/HF-MODEL/huggingface-model/qwen-7b CHECKPOINT_PATH=/opt/HF-MODEL/huggingface-model/qwen-7b MODEL_TYPE=qwen FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
[process-385289][root][05/10/2024 15:11:35][init.py:():14][INFO] init logger end
[process-385289][root][05/10/2024 15:11:37][init.py:():28][INFO] no internal_source found
[process-385289][root][05/10/2024 15:11:37][hippo_helper.py:HippoHelper():13][INFO] get container_ip from socket:127.0.1.1
[process-385289][root][05/10/2024 15:11:37][report_worker.py:init():31][INFO] kmonitor report default tags: {}
[process-385289][root][05/10/2024 15:11:37][report_worker.py:init():44][INFO] test mode, kmonitor metrics not reported.
[process-385289][root][05/10/2024 15:11:37][gpu_util.py:init():30][INFO] detected [4] gpus
[process-385289][root][05/10/2024 15:11:38][init.py:():9][INFO] no internal_source found
[process-385289][root][05/10/2024 15:11:38][start_server.py:local_rank_start():30][INFO] start local WorkerInfo: [ip=127.0.1.1 server_port=8088 gang_hb_port=8089 name= info=None ], ParallelInfo:[ tp_size=1 pp_size=1 world_size=1 world_rank=0 local_world_size=1 ]
[process-385289][root][05/10/2024 15:11:38][inference_server.py:_init_controller():87][INFO] CONCURRENCY_LIMIT to 32
[process-385289][root][05/10/2024 15:11:38][gang_server.py:start():173][INFO] world_size==1, do not start gang_server
[process-385289][root][05/10/2024 15:11:38][util.py:copy_gemm_config():131][INFO] not found gemm_config in HIPPO_APP_INST_ROOT, not copy
[process-385289][root][05/10/2024 15:11:38][inference_worker.py:init():51][INFO] starting InferenceWorker
[process-385289][root][05/10/2024 15:11:38][model_factory.py:create_normal_model_config():116][INFO] load model from tokenizer_path: /opt/HF-MODEL/huggingface-model/qwen-7b, ckpt_path: /opt/HF-MODEL/huggingface-model/qwen-7b, lora_infos: {}, ptuning_path: None
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():303][INFO] max_seq_len: 8192
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_config_with_sparse_config():172][INFO] read sparse config from: /opt/HF-MODEL/huggingface-model/qwen-7b/config.json
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:check():64][INFO] sparse config layer_num must not be empty
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_ptuning_config():260][INFO] use ptuning from model_config set by env, None
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_ptuning_config():267][INFO] load ptuing config from /opt/HF-MODEL/huggingface-model/qwen-7b/config.json
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_ptuning_config():274][INFO] read ptuning config, pre_seq_len:0, prefix_projection:False
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():313][INFO] seq_size_per_block: 8
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():315][INFO] max_generate_batch_size: 128
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():317][INFO] max_context_batch_size: 1
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():319][INFO] reserve_runtime_mem_mb: 1024
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():321][INFO] kv_cache_mem_mb: -1
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():323][INFO] pre_allocate_op_mem: True
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():325][INFO] int8_kv_cache: False
[process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():329][INFO] tp_split_emb_and_lm_head: True
[process-385289][root][05/10/2024 15:11:38][model_weights_loader.py:estimate_load_parallel_num():610][INFO] free_mem: 23.26 model_mem: 14.38, load weights by 2 process
[process-385289][root][05/10/2024 15:11:38][model_weights_loader.py:init():87][INFO] merge lora is enable ? : False
[process-385438][root][05/10/2024 15:11:38][init.py:():14][INFO] init logger end
[process-385437][root][05/10/2024 15:11:38][init.py:():14][INFO] init logger end
[process-385437][root][05/10/2024 15:11:40][init.py:():28][INFO] no internal_source found
[process-385438][root][05/10/2024 15:11:40][init.py:():28][INFO] no internal_source found
[process-385437][root][05/10/2024 15:11:40][hippo_helper.py:HippoHelper():13][INFO] get container_ip from socket:127.0.1.1
[process-385437][root][05/10/2024 15:11:40][report_worker.py:init():31][INFO] kmonitor report default tags: {}
[process-385437][root][05/10/2024 15:11:40][report_worker.py:init():44][INFO] test mode, kmonitor metrics not reported.
[process-385438][root][05/10/2024 15:11:40][hippo_helper.py:HippoHelper():13][INFO] get container_ip from socket:127.0.1.1
[process-385438][root][05/10/2024 15:11:40][report_worker.py:init():31][INFO] kmonitor report default tags: {}
[process-385438][root][05/10/2024 15:11:40][report_worker.py:init():44][INFO] test mode, kmonitor metrics not reported.
[process-385438][root][05/10/2024 15:11:40][gpu_util.py:init():30][INFO] detected [4] gpus
[process-385437][root][05/10/2024 15:11:40][gpu_util.py:init():30][INFO] detected [4] gpus
[process-385438][root][05/10/2024 15:11:41][init.py:():9][INFO] no internal_source found
[process-385437][root][05/10/2024 15:11:41][init.py:():9][INFO] no internal_source found
[process-385289][root][05/10/2024 15:11:47][gpt.py:_load_weights():172][INFO] load weights time: 8.23 s
load final_layernorm.gamma to torch.Size([4096])
load final_layernorm.beta to torch.Size([4096])
+------------------------------------------+
| MODEL CONFIG |
+-----------------------+------------------+
| Options | Values |
+-----------------------+------------------+
| model_type | QWen |
| act_type | WEIGHT_TYPE.FP16 |
| weight_type | WEIGHT_TYPE.FP16 |
| max_seq_len | 8192 |
| use_sparse_head | False |
| use_multi_task_prompt | None |
| use_medusa | False |
| lora_infos | {} |
+-----------------------+------------------+
[process-385289][root][05/10/2024 15:11:47][async_model.py:init():28][INFO] first mem info: used:16259481600 free: 9510322176
[process-385289][root][05/10/2024 15:11:47][engine_creator.py:create_engine():46][INFO] executor_type: ExecutorType.Normal
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][INFO][RANK 0][139646433424000][24-05-10 15:11:47] MMHA multi_block_mode is enabled
Segmentation fault (core dumped)
When running the example test, it generated the following issues,
(rtp-llm) h@acc:/opt/HF-MODEL/rtp-llm$ python example/test.py
Fetching 24 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 26051.58it/s]
load final_layernorm.gamma to torch.Size([2048])
load final_layernorm.beta to torch.Size([2048])
+------------------------------------------+
| MODEL CONFIG |
+-----------------------+------------------+
| Options | Values |
+-----------------------+------------------+
| model_type | QWen |
| act_type | WEIGHT_TYPE.FP16 |
| weight_type | WEIGHT_TYPE.FP16 |
| max_seq_len | 8192 |
| use_sparse_head | False |
| use_multi_task_prompt | None |
| use_medusa | False |
| lora_infos | None |
+-----------------------+------------------+
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][INFO][RANK 0][140690512618112][24-05-10 14:59:40] MMHA multi_block_mode is enabled
Segmentation fault (core dumped)
这个问题是因为在cuda12的环境里install了cuda118的whl包;请参考文档使用cuda12的whl包