llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

Guardrail Loading Failed with Unexpected Large GPU Memory Requirement at Multi-GPU Server

Open dawenxi-007 opened this issue 1 year ago • 7 comments

System Info

Python version: 3.10.12 Pytorch version: llama_models version: 0.0.42 llama_stack version: 0.0.42 llama_stack_client version: 0.0.41 Hardware: 4xA100 (40GB VRAM/GPU)

local-gpu-run.yaml file content is as following:

version: '2'
built_at: '2024-10-11T00:06:23.964162'
image_name: local-gpu
docker_image: local-gpu
conda_env: null
apis:
- safety
- memory
- inference
- models
- agents
- memory_banks
- shields
providers:
  inference:
  - provider_id: meta0
    provider_type: meta-reference
    config:
      model: Llama3.1-8B-Instruct
      quantization: null
      torch_seed: null
      max_seq_len: 4096
      max_batch_size: 1
  - provider_id: meta1
    provider_type: meta-reference
    config:
      model: Llama-Guard-3-1B
      quantization: null
      torch_seed: null
      max_seq_len: 4096
      max_batch_size: 1
  safety:
  - provider_id: meta-reference
    provider_type: meta-reference
    config:
      llama_guard_shield:
        model: Llama-Guard-3-1B
        excluded_categories: []
      enable_prompt_guard: true
  memory:
  - provider_id: meta-reference
    provider_type: meta-reference
    config: {}
  agents:
  - provider_id: meta-reference
    provider_type: meta-reference
    config:
      persistence_store:
        namespace: null
        type: sqlite
        db_path: /home/dell/.llama/runtime/kvstore.db
  telemetry:
  - provider_id: meta-reference
    provider_type: meta-reference
    config: {}

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

🐛 Describe the bug

Trying to load the model to initialize the host with the Command: docker run --gpus=all -it -p 5000:5000 -v ~/.llama/builds/docker/local-gpu-run.yaml:/app/config.yaml -v ~/.llama:/root/.llama llamastack/distribution-meta-reference-gpu python -m llama_stack.distribution.server.server --yaml_config /app/config.yaml --port 5000

Error logs

Log file shows the following:

Loading model `Llama3.1-8B-Instruct`
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/opt/conda/lib/python3.10/site-packages/torch/__init__.py:696: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered
internally at /opt/conda/conda-bld/pytorch_1708025847130/work/torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Loaded in 8.79 seconds
Loaded model...
Loading model `Llama-Guard-3-1B`
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/opt/conda/lib/python3.10/site-packages/torch/__init__.py:696: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered
internally at /opt/conda/conda-bld/pytorch_1708025847130/work/torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Loaded in 4.46 seconds
Loaded model...
Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at /root/.llama/checkpoints/Prompt-Guard-86M and are newly initialized:  ...
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.                                                                                                                      [39/1803]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 343, in <module>
    fire.Fire(main)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 279, in main
    impls = asyncio.run(resolve_impls(config))
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/resolver.py", line 181, in resolve_impls
    impl = await instantiate_provider(
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/resolver.py", line 268, in instantiate_provider
    impl = await fn(*args)
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/__init__.py", line 16, in get_provider_impl
    await impl.initialize()
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/safety.py", line 40, in initialize
    _ = PromptGuardShield.instance(model_dir)
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/prompt_guard.py", line 37, in instance
    PromptGuardShield._instances[key] = PromptGuardShield(
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/prompt_guard.py", line 66, in __init__
    model = AutoModelForSequenceClassification.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4091, in from_pretrained
    dispatch_model(model, **device_map_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/big_modeling.py", line 494, in dispatch_model
    model.to(device)
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2958, in to
    return super().to(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 39.38 GiB of which 144.25 MiB is free. Process 878221 has 16.76 GiB memory in use. Process 878311 has 3.98 GiB memory in use. Pr
ocess 878077 has 18.48 GiB memory in use. Of the allocated memory 18.08 GiB is allocated by PyTorch, and 1.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=exp
andable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior

Expect that the models will be successfully loaded into the GPU VRAM with right memory consumption. Note same configuration does not give errors with a 1xH100 machine.

dawenxi-007 avatar Oct 25 '24 20:10 dawenxi-007

Note same configuration does not give errors with a 1xH100 machine.

The error occurs because you ran out of memory on your GPU0, and our meta-reference provider can only load models onto 1 device.

  1. You can chose to not load prompt guard via enable_prompt_guard: false.
  2. Or use a remote inference provider (e.g. TGI) to run Llama3.1-8B-Instruct on another GPU. E.g. run TGI server on GPU=1 via
docker run --rm -it --network host -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus 1 ghcr.io/huggingface/text-generation-inference:latest --usage-stats on --sharded false --model-id meta-llama/Llama3.1-8B-Instruct --port 5009

Then set inference to point to the TGI server.

inference:
  - provider_id: tgi0
    provider_type: remote::tgi
    config:
      url: http://127.0.0.1:5009

yanxi0830 avatar Oct 29 '24 02:10 yanxi0830

Verified suggestion 1, it worked by disabling Prompt Guard. However, I don't understand why a 40GB GPU is not enough to hold an additional Prompt-Guard-86M model?

For suggestion 2, if we run remote tgi for Llama3.1-8B-Instruct, where should I run Prompt-Guard-86M and Llama-Guard-3-1B? Which image is recommended (distribution-meta-reference-gpu or 'llamastack-local-gpu' or 'llamastack-local-cpu')? Using distribution-meta-reference-gpu gave me the following error:

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/resolver.py", line 181, in resolve_impls
    impl = await instantiate_provider(
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/resolver.py", line 268, in instantiate_provider
    impl = await fn(*args)
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/adapters/inference/tgi/__init__.py", line 28, in get_adapter_impl
    await impl.initialize(config)
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/adapters/inference/tgi/tgi.py", line 171, in initialize
    endpoint_info = await self.client.get_endpoint_info()
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/inference/_generated/_async_client.py", line 3144, in get_endpoint_info
    async with self._get_client_session() as client:
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/inference/_generated/_async_client.py", line 2998, in _get_client_session
    aiohttp = _import_aiohttp()
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/inference/_common.py", line 117, in _import_aiohttp
    raise ImportError("Please install aiohttp to use `AsyncInferenceClient` (`pip install aiohttp`).")
ImportError: Please install aiohttp to use `AsyncInferenceClient` (`pip install aiohttp`).

dawenxi-007 avatar Oct 29 '24 06:10 dawenxi-007

I did some further experiments on the suggestion 2 by adding the guardrail model along with a remote TGI server. It will trigger the fairscale module not found error even I installed it. The following is the run.yaml file:

version: '2'
built_at: '2024-10-08T17:40:45.325529'
image_name: local
docker_image: null
conda_env: local
apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
providers:
  inference:
  - provider_id: tgi0
    provider_type: remote::tgi
    config:
      url: http://127.0.0.1:80
  - provider_id: meta0
    provider_type: meta-reference
    config:
      model: Llama-Guard-3-1B
      quantization: null
      torch_seed: null
      max_seq_len: 4906
      max_batch_size: 1
  safety:
  - provider_id: meta-reference
    provider_type: meta-reference
    config:
      llama_guard_shield:
        model: Llama-Guard-3-1B
        excluded_categories: []
        disable_input_check: false
        disable_output_check: false
      enable_prompt_guard: false
      prompt_guard_shield:  
        model: Prompt-Guard-86M
  memory:
  - provider_id: meta0
    provider_type: meta-reference
    config: {}
  agents:
  - provider_id: meta0
    provider_type: meta-reference
    config:
      persistence_store:
        namespace: null
        type: sqlite
        db_path: ~/.llama/runtime/kvstore.db
  telemetry:
  - provider_id: meta0
    provider_type: meta-reference
    config: {}

Command to run: docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml

Error message:

─
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/resolver.py", line 278, in instantiate_provider                             [3/1934]
    impl = await fn(*args)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/__init__.py", line 16, in get_provider_impl
    from .inference import MetaReferenceInferenceImpl
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/inference.py", line 18, in <module>
    from .generation import Llama
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/generation.py", line 20, in <module>
    from fairscale.nn.model_parallel.initialize import (
ModuleNotFoundError: No module named 'fairscale'

If I disable the following in the yaml file, it runs ok.

  - provider_id: meta0
    provider_type: meta-reference
    config:
      model: Llama-Guard-3-1B
      quantization: null
      torch_seed: null
      max_seq_len: 4906
      max_batch_size: 1

The version of the related packages are:

fairscale                                0.4.13
llama_models                             0.0.47
llama_stack                              0.0.47
llama_stack_client                       0.0.48

dawenxi-007 avatar Nov 05 '24 19:11 dawenxi-007

You are running the tgi distribution, which do not install the required dependency for fairscale, which is an dependency for meta-reference provider: https://github.com/meta-llama/llama-stack/blob/dcd8cfe0f3bc951328ee0c2165ec29e6d433759f/llama_stack/providers/registry/inference.py#L12

docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml

yanxi0830 avatar Nov 05 '24 20:11 yanxi0830

I did have all the following packages installed:

(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep accelerate
accelerate                               1.1.0
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep blobfile
blobfile                                 3.0.0
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep torch
torch                                    2.5.1
torchvision                              0.20.1
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep transformers
transformers                             4.46.2
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep zmq
pyzmq                                    26.2.0
zmq                                      0.0.0
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep lm-format-enforcer
lm-format-enforcer                       0.10.9
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep fairscale
fairscale                                0.4.13

I even tried different versions of the fairscale package. It always gave the same error ModuleNotFoundError: No module named 'fairscale'.

dawenxi-007 avatar Nov 05 '24 21:11 dawenxi-007

I did have all the following packages installed:

(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep accelerate
accelerate                               1.1.0
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep blobfile
blobfile                                 3.0.0
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep torch
torch                                    2.5.1
torchvision                              0.20.1
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep transformers
transformers                             4.46.2
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep zmq
pyzmq                                    26.2.0
zmq                                      0.0.0
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep lm-format-enforcer
lm-format-enforcer                       0.10.9
(llamastk_delltgi_env) tao@r7625h100:~/demo_1104/llama-stack/distributions/dell-tgi$ pip list | grep fairscale
fairscale                                0.4.13

I even tried different versions of the fairscale package. It always gave the same error ModuleNotFoundError: No module named 'fairscale'.

The reason is

  1. The fairscale package is installed outside the docker container.
  2. However, docker run command is used to run the docker container, the docker image you are using is distribution-tgi which do not have fairscale package installed. Hence it will give a ModuleNotFoundError.

To be able to run both meta-reference and tgi inference. You could build your own distribution using the following build config.

llama stack build --config ./build.yaml

where .build.yaml contains the following config:

name: tgi-meta-reference
distribution_spec:
  description: Use code from `llama_stack` itself to serve all llama stack APIs
  docker_image: pytorch/pytorch:2.5.0-cuda12.4-cudnn9-runtime
  providers:
    inference: 
    - meta-reference
    - remote::tgi
    memory: meta-reference
    safety: meta-reference
    agentic_system: meta-reference
    telemetry: meta-reference
image_type: docker

yanxi0830 avatar Nov 06 '24 05:11 yanxi0830

Thanks. With the config, I was able to build the image with the name distribution-tgi-meta-reference and noticed that I have the fairscale has been installed during the image building process. However, I got model not found issue even I downloaded it into ~/.llama/checkpoints folder and export LLAMA_CHECKPOINT_DIR=~/.llama. From the error message below, it still looks for root folder.

AssertionError: Could not find checkpoints in: /root/.llama/checkpoints/Llama-Guard-3-1B. Please download model using `llama download --model-id Llama-Guard-3-1B`.

How can I change the model default location with the new built image for the docker run?

dawenxi-007 avatar Nov 06 '24 20:11 dawenxi-007

This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

github-actions[bot] avatar Mar 14 '25 00:03 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant!

github-actions[bot] avatar Apr 13 '25 00:04 github-actions[bot]