leiwen83
leiwen83
目前多机并行的时候,deepspeed用的管理器是pdsh还是用mpi?有遇到因为网络问题导致的不稳定问题吗?
tests/ folder also suffer from this vllm_ops not defined issue. And I create this PR for pytest, https://github.com/vllm-project/vllm/pull/4231, which force pytest to search module from installed place.
@cadedaniel I submit a rebased PR, which keep the concat logic as before. num_spec is made to aggregate "k" number.
Try qwen2 multi-gpu support patch https://github.com/mlc-ai/mlc-llm/pull/1985 with latest code: https://github.com/mlc-ai/mlc-llm/commit/ae97b8d3763cd9ef9179140027d206622d185d21 But got below error when compile model. ``` File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py",...
After revert to a0484bd53854a508283be47d62b704b2c737259d, with https://github.com/mlc-ai/mlc-llm/pull/1985 still get cuda oom for 72B int4 with 4gpus
@tlopex any idea?
> @leiwen83 Are you running without quantization? 72B model doesn't seem to fit four 3090 with each having 24GB vRAM. I am running with quantization from HF model: Here is...
I find there is setting named "MLC_INTERNAL_PRESHARD_NUM". Do I need to set this when do the convert model and before serving?
Still get error... ``` Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line...
After pull latest commit, it would report error when converting the model. ``` # python3 -m mlc_llm convert_weight --quantization q4f16_1 Qwen1.5-72B-Chat --output Qwen1.5-72B-Chat_tvm [2024-03-28 14:47:03] INFO auto_config.py:115: Found model configuration:...