rtp-llm example test issue

Open haic0 opened this issue 1 year ago • 1 comments

HI DevTeam, Could you give me a hand to check this issue, thanks so much!

After installed the whl package successfully, follow this guide, cd rtp-llm

For cuda12 environment, please use requirements_torch_gpu_cuda12.txt

pip3 install -r ./open_source/deps/requirements_torch_gpu.txt

Use the corresponding whl from the release version, here's an example for the cuda11 version 0.1.0, for the cuda12 whl package please check the release page.

pip3 install maga_transformer-0.1.9+cuda118-cp310-cp310-manylinux1_x86_64.whl

start http service

cd ../ TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server

Issues It generated the following issue, could you give some suggestions, (rtp-llm) h@acc:/opt/HF-MODEL$ TOKENIZER_PATH=/opt/HF-MODEL/huggingface-model/qwen-7b CHECKPOINT_PATH=/opt/HF-MODEL/huggingface-model/qwen-7b MODEL_TYPE=qwen FT_SERVER_TEST=1 python3 -m maga_transformer.start_server [process-385289][root][05/10/2024 15:11:35][init.py:():14][INFO] init logger end [process-385289][root][05/10/2024 15:11:37][init.py:():28][INFO] no internal_source found [process-385289][root][05/10/2024 15:11:37][hippo_helper.py:HippoHelper():13][INFO] get container_ip from socket:127.0.1.1 [process-385289][root][05/10/2024 15:11:37][report_worker.py:init():31][INFO] kmonitor report default tags: {} [process-385289][root][05/10/2024 15:11:37][report_worker.py:init():44][INFO] test mode, kmonitor metrics not reported. [process-385289][root][05/10/2024 15:11:37][gpu_util.py:init():30][INFO] detected [4] gpus [process-385289][root][05/10/2024 15:11:38][init.py:():9][INFO] no internal_source found [process-385289][root][05/10/2024 15:11:38][start_server.py:local_rank_start():30][INFO] start local WorkerInfo: [ip=127.0.1.1 server_port=8088 gang_hb_port=8089 name= info=None ], ParallelInfo:[ tp_size=1 pp_size=1 world_size=1 world_rank=0 local_world_size=1 ] [process-385289][root][05/10/2024 15:11:38][inference_server.py:_init_controller():87][INFO] CONCURRENCY_LIMIT to 32 [process-385289][root][05/10/2024 15:11:38][gang_server.py:start():173][INFO] world_size==1, do not start gang_server [process-385289][root][05/10/2024 15:11:38][util.py:copy_gemm_config():131][INFO] not found gemm_config in HIPPO_APP_INST_ROOT, not copy [process-385289][root][05/10/2024 15:11:38][inference_worker.py:init():51][INFO] starting InferenceWorker [process-385289][root][05/10/2024 15:11:38][model_factory.py:create_normal_model_config():116][INFO] load model from tokenizer_path: /opt/HF-MODEL/huggingface-model/qwen-7b, ckpt_path: /opt/HF-MODEL/huggingface-model/qwen-7b, lora_infos: {}, ptuning_path: None [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():303][INFO] max_seq_len: 8192 [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_config_with_sparse_config():172][INFO] read sparse config from: /opt/HF-MODEL/huggingface-model/qwen-7b/config.json [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:check():64][INFO] sparse config layer_num must not be empty [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_ptuning_config():260][INFO] use ptuning from model_config set by env, None [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_ptuning_config():267][INFO] load ptuing config from /opt/HF-MODEL/huggingface-model/qwen-7b/config.json [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_ptuning_config():274][INFO] read ptuning config, pre_seq_len:0, prefix_projection:False [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():313][INFO] seq_size_per_block: 8 [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():315][INFO] max_generate_batch_size: 128 [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():317][INFO] max_context_batch_size: 1 [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():319][INFO] reserve_runtime_mem_mb: 1024 [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():321][INFO] kv_cache_mem_mb: -1 [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():323][INFO] pre_allocate_op_mem: True [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():325][INFO] int8_kv_cache: False [process-385289][root][05/10/2024 15:11:38][gpt_init_model_parameters.py:update_common():329][INFO] tp_split_emb_and_lm_head: True [process-385289][root][05/10/2024 15:11:38][model_weights_loader.py:estimate_load_parallel_num():610][INFO] free_mem: 23.26 model_mem: 14.38, load weights by 2 process [process-385289][root][05/10/2024 15:11:38][model_weights_loader.py:init():87][INFO] merge lora is enable ? : False [process-385438][root][05/10/2024 15:11:38][init.py:():14][INFO] init logger end [process-385437][root][05/10/2024 15:11:38][init.py:():14][INFO] init logger end [process-385437][root][05/10/2024 15:11:40][init.py:():28][INFO] no internal_source found [process-385438][root][05/10/2024 15:11:40][init.py:():28][INFO] no internal_source found [process-385437][root][05/10/2024 15:11:40][hippo_helper.py:HippoHelper():13][INFO] get container_ip from socket:127.0.1.1 [process-385437][root][05/10/2024 15:11:40][report_worker.py:init():31][INFO] kmonitor report default tags: {} [process-385437][root][05/10/2024 15:11:40][report_worker.py:init():44][INFO] test mode, kmonitor metrics not reported. [process-385438][root][05/10/2024 15:11:40][hippo_helper.py:HippoHelper():13][INFO] get container_ip from socket:127.0.1.1 [process-385438][root][05/10/2024 15:11:40][report_worker.py:init():31][INFO] kmonitor report default tags: {} [process-385438][root][05/10/2024 15:11:40][report_worker.py:init():44][INFO] test mode, kmonitor metrics not reported. [process-385438][root][05/10/2024 15:11:40][gpu_util.py:init():30][INFO] detected [4] gpus [process-385437][root][05/10/2024 15:11:40][gpu_util.py:init():30][INFO] detected [4] gpus [process-385438][root][05/10/2024 15:11:41][init.py:():9][INFO] no internal_source found [process-385437][root][05/10/2024 15:11:41][init.py:():9][INFO] no internal_source found [process-385289][root][05/10/2024 15:11:47][gpt.py:_load_weights():172][INFO] load weights time: 8.23 s load final_layernorm.gamma to torch.Size([4096]) load final_layernorm.beta to torch.Size([4096]) +------------------------------------------+ | MODEL CONFIG | +-----------------------+------------------+ | Options | Values | +-----------------------+------------------+ | model_type | QWen | | act_type | WEIGHT_TYPE.FP16 | | weight_type | WEIGHT_TYPE.FP16 | | max_seq_len | 8192 | | use_sparse_head | False | | use_multi_task_prompt | None | | use_medusa | False | | lora_infos | {} | +-----------------------+------------------+ [process-385289][root][05/10/2024 15:11:47][async_model.py:init():28][INFO] first mem info: used:16259481600 free: 9510322176 [process-385289][root][05/10/2024 15:11:47][engine_creator.py:create_engine():46][INFO] executor_type: ExecutorType.Normal [WARNING] gemm_config.in is not found; using default GEMM algo [FT][INFO][RANK 0][139646433424000][24-05-10 15:11:47] MMHA multi_block_mode is enabled Segmentation fault (core dumped)

When running the example test, it generated the following issues,

[WARNING] gemm_config.in is not found; using default GEMM algo

[FT][INFO][RANK 0][140690512618112][24-05-10 14:59:40] MMHA multi_block_mode is enabled Segmentation fault (core dumped)

May 10 '24 07:05 haic0

这个问题是因为在cuda12的环境里install了cuda118的whl包；请参考文档使用cuda12的whl包

May 10 '24 09:05 dongjiyingdjy