DeepSpeed-MII How to load my local model

import mii
mii_configs = {"tensor_parallel": 2, "dtype": "fp16",  "skip_model_check": True}
mii.deploy(task="text-generation",
           model="/home/chenweisheng/final_models/model-selection-merged/vicuna_13b",
           deployment_name="vicuna_13b_deployment",
           mii_config=mii_configs)

I tried this , but it didn't work. Just can see output :[2023-09-15 09:02:17,889] [INFO] [server.py:110:_wait_until_server_is_live] waiting for server to start...

Sep 15 '23 09:09 UncleFB

Hi @UncleFB I just tested this locally (with a different model) and it works for me. Can you verify that the path you are providing is to a directory with a HuggingFace-like checkpoint. For example, my directory contains the following:

config.json  model.safetensors  special_tokens_map.json  tokenizer_config.json  tokenizer.json

Sep 15 '23 20:09 mrwyattii

It took seven minutes for the model to start loading. But no matter if I set tensor_parallel to 2 or 4, OOM will occur. Isn’t the model loaded on multiple GPUs? My model is vicuna_13b, my gpu memory is 24G

Sep 18 '23 03:09 UncleFB

Is it possible to load a fine-tuned LLAma model (HuggingFace) using this ?

Sep 18 '23 08:09 fr-ashikaumagiliya

@UncleFB How much GPU memory do you have? You may need to enable load_with_sys_mem: https://github.com/microsoft/DeepSpeed-MII/blob/0182fa565d3fa30f186162c48ae68bac4d2866ef/mii/config.py#L45

The reason for this is that the current implementation of DeepSpeed-Inference requires loading a full copy of the model for each process before the model is split across multiple GPUs. We can avoid these OOM issues by using system memory instead. Please let me know if this solved your problem!

Sep 18 '23 18:09 mrwyattii

Is it possible to load a fine-tuned LLAma model (HuggingFace) using this ?

@fr-ashikaumagiliya yes this should be possible. Under the hood, we are using transformers.pipeline to load the model and tokenizer. So if you are able to load the model with transformers.pipeline(task="text-generation", model="/path/to/your/model") then it should work!

Sep 18 '23 18:09 mrwyattii

@mrwyattii Thank you for your help. We have 8 24G GPUs, but it seems that I cannot specify which gpu to use by specifying CUDA_VISIBLE_DEVICES.

Sep 19 '23 05:09 UncleFB

@mrwyattii Thank you for your help. We have 8 24G GPUs, but it seems that I cannot specify which gpu to use by specifying CUDA_VISIBLE_DEVICES.

Please note that CUDA_VISIBLE_DEVICES environment variable does not work with DeepSpeed. Therefore you must provide the GPU indices via deploy_rank in your mii_config. For example, if you want to use GPUS 4, 5, 6, 7: mii_config = {"deploy_rank": [4, 5, 6, 7]}: https://github.com/microsoft/DeepSpeed-MII/blob/0182fa565d3fa30f186162c48ae68bac4d2866ef/mii/config.py#L48

Sep 19 '23 16:09 mrwyattii

Thanks again. Another question, why does it take a long time before loading my local model. I can keep seeing the log of waiting for the service to start, and it doesn't start loading the model until six or seven minutes later.

Sep 20 '23 05:09 UncleFB

@mrwyattii Thank you for your help. We have 8 24G GPUs, but it seems that I cannot specify which gpu to use by specifying CUDA_VISIBLE_DEVICES.

Please note that CUDA_VISIBLE_DEVICES environment variable does not work with DeepSpeed. Therefore you must provide the GPU indices via deploy_rank in your mii_config. For example, if you want to use GPUS 4, 5, 6, 7: mii_config = {"deploy_rank": [4, 5, 6, 7]}:

https://github.com/microsoft/DeepSpeed-MII/blob/0182fa565d3fa30f186162c48ae68bac4d2866ef/mii/config.py#L48

I try to set this param, but it seems not work

Sep 20 '23 08:09 UncleFB

I also find that deploy_rank seems does not work

Oct 18 '23 10:10 yemin1996

I can't load my local model, it's always asked to add the huggingface token

My code

import mii

model_path = (
    "/home/ubuntu/models/meta-llama_Llama-2-13b-chat-hf"
)

mii_configs = {
    "tensor_parallel": 4,
    "dtype": "fp16",
    "enable_restful_api": True,
    "trust_remote_code": True,
    "max_tokens": 4096,
    "tensor_parallel": 4,
    "dtype": "fp16",
    "hf_auth_token": None,
}

mii.deploy(
    task="text-generation",
    model="meta-llama/Llama-2-13b-chat-hf",
    deployment_name="llama-2-13b-chat-hf-deployment",
    deployment_type="local",
    model_path=model_path,
    mii_config=mii_configs,
)

The exception

  File "/home/ubuntu/deepspeed/venv/lib/python3.9/site-packages/transformers/pipelines/__init__.py", line 751, in pipeline
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-653ceea0-268cb46920c0b60a096bbefb;671dae50-f9f4-453e-8789-cacc20f44080)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/resolve/main/config.json.
Repo model meta-llama/Llama-2-13b-chat-hf is gated. You must be authenticated to access it.

Oct 28 '23 11:10 muhammad-asn

@mrwyattii Thank you for your help. We have 8 24G GPUs, but it seems that I cannot specify which gpu to use by specifying CUDA_VISIBLE_DEVICES.谢谢你的帮助。我们有 8 个 24G GPU，但似乎我无法通过指定 CUDA_VISIBLE_DEVICES 来指定要使用的 GPU。

Please note that CUDA_VISIBLE_DEVICES environment variable does not work with DeepSpeed. Therefore you must provide the GPU indices via deploy_rank in your mii_config. For example, if you want to use GPUS 4, 5, 6, 7: mii_config = {"deploy_rank": [4, 5, 6, 7]}:请注意， CUDA_VISIBLE_DEVICES 环境变量不适用于 DeepSpeed。因此，您必须通过 deploy_rank 提供 GPU 索引 mii_config 。例如，如果要使用 GPU 4、5、6、7： mii_config = {"deploy_rank": [4, 5, 6, 7]} https://github.com/microsoft/DeepSpeed-MII/blob/0182fa565d3fa30f186162c48ae68bac4d2866ef/mii/config.py#L48

I try to set this param, but it seems not work我尝试设置这个参数，但似乎不起作用

I have the same situation: Setting mii_config = {"deploy_rank": [4, 5, 6, 7]} does not work.

Jan 07 '24 14:01 AdAstraAbyssoque

DeepSpeed-MII DeepSpeed-MII copied to clipboard

How to load my local model

DeepSpeed-MII
DeepSpeed-MII copied to clipboard