QAnything [BUG] win11 wsl2 运行MiniChat-2-3B不成功

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

显卡12G，采用下面命令：bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat。运行报错： console输出： qanything-container-local | LLM 服务正在启动，可能需要一段时间...你有时间去冲杯咖啡 :) qanything-container-local | % Total % Received % Xferd Average Speed Time Time Time Current qanything-container-local | Dload Upload Total Spent Left Speed 100 13 100 13 0 0 14099 0 --:--:-- --:--:-- --:--:-- 13000 qanything-container-local | The llm service is starting up, it can be long... you have time to make a coffee :) qanything-container-local | LLM 服务正在启动，可能需要一段时间...你有时间去冲杯咖啡 :) qanything-container-local | 启动 LLM 服务超时，自动检查 /workspace/qanything_local/logs/debug_logs/fastchat_logs/fschat_model_worker_7801.log 中是否存在Error... qanything-container-local | /workspace/qanything_local/logs/debug_logs/fastchat_logs/fschat_model_worker_7801.log 中未检测到明确的错误信息。请手动排查 /workspace/qanything_local/logs/debug_logs/fastchat_logs/fschat_model_worker_7801.log 以获取更多信息。 fschat_model_worker_7801.log日志： 2024-04-09 17:04:57 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=7801, worker_address='http://0.0.0.0:7801', controller_address='http://0.0.0.0:7800', model_path='/model_repos/CustomLLM/MiniChat-2-3B', revision='main', device='cuda', gpus='0', num_gpus=1, max_gpu_memory=None, dtype='bfloat16', load_8bit=True, cpu_offloading=False, gptq_ckpt=None, gptq_wbits=16, gptq_groupsize=-1, gptq_act_order=False, awq_ckpt=None, awq_wbits=16, awq_groupsize=-1, enable_exllama=False, exllama_max_seq_len=4096, exllama_gpu_split=None, exllama_cache_8bit=False, enable_xft=False, xft_max_seq_len=4096, xft_dtype=None, model_names=None, conv_template='minichat', embed_in_truncate=False, limit_worker_concurrency=5, stream_interval=2, no_register=False, seed=None, debug=False, ssl=False) 2024-04-09 17:04:57 | INFO | model_worker | Loading the model ['MiniChat-2-3B'] on worker e44e3aad ... You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 2024-04-09 17:04:58 | ERROR | stderr | 0%| | 0/1 [00:00<?, ?it/s]

期望行为 | Expected Behavior

No response

运行环境 | Environment

- OS: Windows 11 WSL2
- NVIDIA Driver:537.70
- CUDA:
- docker:Docker Desktop 4.28.0 (139021)
- docker-compose:
- NVIDIA GPU:RTX3060
- NVIDIA GPU Memory:12G

QAnything日志 | QAnything logs

console输出： qanything-container-local | LLM 服务正在启动，可能需要一段时间...你有时间去冲杯咖啡 :) qanything-container-local | % Total % Received % Xferd Average Speed Time Time Time Current qanything-container-local | Dload Upload Total Spent Left Speed 100 13 100 13 0 0 14099 0 --:--:-- --:--:-- --:--:-- 13000 qanything-container-local | The llm service is starting up, it can be long... you have time to make a coffee :) qanything-container-local | LLM 服务正在启动，可能需要一段时间...你有时间去冲杯咖啡 :) qanything-container-local | 启动 LLM 服务超时，自动检查 /workspace/qanything_local/logs/debug_logs/fastchat_logs/fschat_model_worker_7801.log 中是否存在Error... qanything-container-local | /workspace/qanything_local/logs/debug_logs/fastchat_logs/fschat_model_worker_7801.log 中未检测到明确的错误信息。请手动排查 /workspace/qanything_local/logs/debug_logs/fastchat_logs/fschat_model_worker_7801.log 以获取更多信息。 fschat_model_worker_7801.log日志： 2024-04-09 17:04:57 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=7801, worker_address='http://0.0.0.0:7801', controller_address='http://0.0.0.0:7800', model_path='/model_repos/CustomLLM/MiniChat-2-3B', revision='main', device='cuda', gpus='0', num_gpus=1, max_gpu_memory=None, dtype='bfloat16', load_8bit=True, cpu_offloading=False, gptq_ckpt=None, gptq_wbits=16, gptq_groupsize=-1, gptq_act_order=False, awq_ckpt=None, awq_wbits=16, awq_groupsize=-1, enable_exllama=False, exllama_max_seq_len=4096, exllama_gpu_split=None, exllama_cache_8bit=False, enable_xft=False, xft_max_seq_len=4096, xft_dtype=None, model_names=None, conv_template='minichat', embed_in_truncate=False, limit_worker_concurrency=5, stream_interval=2, no_register=False, seed=None, debug=False, ssl=False) 2024-04-09 17:04:57 | INFO | model_worker | Loading the model ['MiniChat-2-3B'] on worker e44e3aad ... You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 2024-04-09 17:04:58 | ERROR | stderr | 0%| | 0/1 [00:00<?, ?it/s]