DeepSpeed-MII
DeepSpeed-MII copied to clipboard
How to load my local model
import mii
mii_configs = {"tensor_parallel": 2, "dtype": "fp16", "skip_model_check": True}
mii.deploy(task="text-generation",
model="/home/chenweisheng/final_models/model-selection-merged/vicuna_13b",
deployment_name="vicuna_13b_deployment",
mii_config=mii_configs)
I tried this , but it didn't work. Just can see output :[2023-09-15 09:02:17,889] [INFO] [server.py:110:_wait_until_server_is_live] waiting for server to start...
Hi @UncleFB I just tested this locally (with a different model) and it works for me. Can you verify that the path you are providing is to a directory with a HuggingFace-like checkpoint. For example, my directory contains the following:
config.json model.safetensors special_tokens_map.json tokenizer_config.json tokenizer.json
It took seven minutes for the model to start loading. But no matter if I set tensor_parallel to 2 or 4, OOM will occur. Isn’t the model loaded on multiple GPUs? My model is vicuna_13b, my gpu memory is 24G
Is it possible to load a fine-tuned LLAma model (HuggingFace) using this ?
@UncleFB How much GPU memory do you have? You may need to enable load_with_sys_mem
: https://github.com/microsoft/DeepSpeed-MII/blob/0182fa565d3fa30f186162c48ae68bac4d2866ef/mii/config.py#L45
The reason for this is that the current implementation of DeepSpeed-Inference requires loading a full copy of the model for each process before the model is split across multiple GPUs. We can avoid these OOM issues by using system memory instead. Please let me know if this solved your problem!
Is it possible to load a fine-tuned LLAma model (HuggingFace) using this ?
@fr-ashikaumagiliya yes this should be possible. Under the hood, we are using transformers.pipeline
to load the model and tokenizer. So if you are able to load the model with transformers.pipeline(task="text-generation", model="/path/to/your/model")
then it should work!
@mrwyattii Thank you for your help. We have 8 24G GPUs, but it seems that I cannot specify which gpu to use by specifying CUDA_VISIBLE_DEVICES.
@mrwyattii Thank you for your help. We have 8 24G GPUs, but it seems that I cannot specify which gpu to use by specifying CUDA_VISIBLE_DEVICES.
Please note that CUDA_VISIBLE_DEVICES
environment variable does not work with DeepSpeed. Therefore you must provide the GPU indices via deploy_rank
in your mii_config
. For example, if you want to use GPUS 4, 5, 6, 7: mii_config = {"deploy_rank": [4, 5, 6, 7]}
:
https://github.com/microsoft/DeepSpeed-MII/blob/0182fa565d3fa30f186162c48ae68bac4d2866ef/mii/config.py#L48
Thanks again. Another question, why does it take a long time before loading my local model. I can keep seeing the log of waiting for the service to start, and it doesn't start loading the model until six or seven minutes later.
@mrwyattii Thank you for your help. We have 8 24G GPUs, but it seems that I cannot specify which gpu to use by specifying CUDA_VISIBLE_DEVICES.
Please note that
CUDA_VISIBLE_DEVICES
environment variable does not work with DeepSpeed. Therefore you must provide the GPU indices viadeploy_rank
in yourmii_config
. For example, if you want to use GPUS 4, 5, 6, 7:mii_config = {"deploy_rank": [4, 5, 6, 7]}
:https://github.com/microsoft/DeepSpeed-MII/blob/0182fa565d3fa30f186162c48ae68bac4d2866ef/mii/config.py#L48
I try to set this param, but it seems not work
I also find that deploy_rank seems does not work
I can't load my local model, it's always asked to add the huggingface token
My code
import mii
model_path = (
"/home/ubuntu/models/meta-llama_Llama-2-13b-chat-hf"
)
mii_configs = {
"tensor_parallel": 4,
"dtype": "fp16",
"enable_restful_api": True,
"trust_remote_code": True,
"max_tokens": 4096,
"tensor_parallel": 4,
"dtype": "fp16",
"hf_auth_token": None,
}
mii.deploy(
task="text-generation",
model="meta-llama/Llama-2-13b-chat-hf",
deployment_name="llama-2-13b-chat-hf-deployment",
deployment_type="local",
model_path=model_path,
mii_config=mii_configs,
)
The exception
File "/home/ubuntu/deepspeed/venv/lib/python3.9/site-packages/transformers/pipelines/__init__.py", line 751, in pipeline
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-653ceea0-268cb46920c0b60a096bbefb;671dae50-f9f4-453e-8789-cacc20f44080)
Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/resolve/main/config.json.
Repo model meta-llama/Llama-2-13b-chat-hf is gated. You must be authenticated to access it.
@mrwyattii Thank you for your help. We have 8 24G GPUs, but it seems that I cannot specify which gpu to use by specifying CUDA_VISIBLE_DEVICES.谢谢你的帮助。我们有 8 个 24G GPU,但似乎我无法通过指定 CUDA_VISIBLE_DEVICES 来指定要使用的 GPU。
Please note that
CUDA_VISIBLE_DEVICES
environment variable does not work with DeepSpeed. Therefore you must provide the GPU indices viadeploy_rank
in yourmii_config
. For example, if you want to use GPUS 4, 5, 6, 7:mii_config = {"deploy_rank": [4, 5, 6, 7]}
:请注意,CUDA_VISIBLE_DEVICES
环境变量不适用于 DeepSpeed。因此,您必须通过deploy_rank
提供 GPU 索引mii_config
。例如,如果要使用 GPU 4、5、6、7:mii_config = {"deploy_rank": [4, 5, 6, 7]}
https://github.com/microsoft/DeepSpeed-MII/blob/0182fa565d3fa30f186162c48ae68bac4d2866ef/mii/config.py#L48I try to set this param, but it seems not work我尝试设置这个参数,但似乎不起作用
I have the same situation: Setting mii_config = {"deploy_rank": [4, 5, 6, 7]}
does not work.