llama-stack
llama-stack copied to clipboard
Guardrail Loading Failed with Unexpected Large GPU Memory Requirement at Multi-GPU Server
System Info
Python version: 3.10.12 Pytorch version: llama_models version: 0.0.42 llama_stack version: 0.0.42 llama_stack_client version: 0.0.41 Hardware: 4xA100 (40GB VRAM/GPU)
local-gpu-run.yaml file content is as following:
version: '2'
built_at: '2024-10-11T00:06:23.964162'
image_name: local-gpu
docker_image: local-gpu
conda_env: null
apis:
- safety
- memory
- inference
- models
- agents
- memory_banks
- shields
providers:
inference:
- provider_id: meta0
provider_type: meta-reference
config:
model: Llama3.1-8B-Instruct
quantization: null
torch_seed: null
max_seq_len: 4096
max_batch_size: 1
- provider_id: meta1
provider_type: meta-reference
config:
model: Llama-Guard-3-1B
quantization: null
torch_seed: null
max_seq_len: 4096
max_batch_size: 1
safety:
- provider_id: meta-reference
provider_type: meta-reference
config:
llama_guard_shield:
model: Llama-Guard-3-1B
excluded_categories: []
enable_prompt_guard: true
memory:
- provider_id: meta-reference
provider_type: meta-reference
config: {}
agents:
- provider_id: meta-reference
provider_type: meta-reference
config:
persistence_store:
namespace: null
type: sqlite
db_path: /home/dell/.llama/runtime/kvstore.db
telemetry:
- provider_id: meta-reference
provider_type: meta-reference
config: {}
Information
- [ ] The official example scripts
- [ ] My own modified scripts
🐛 Describe the bug
Trying to load the model to initialize the host with the Command:
docker run --gpus=all -it -p 5000:5000 -v ~/.llama/builds/docker/local-gpu-run.yaml:/app/config.yaml -v ~/.llama:/root/.llama llamastack/distribution-meta-reference-gpu python -m llama_stack.distribution.server.server --yaml_config /app/config.yaml --port 5000
Error logs
Log file shows the following:
Loading model `Llama3.1-8B-Instruct`
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/opt/conda/lib/python3.10/site-packages/torch/__init__.py:696: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered
internally at /opt/conda/conda-bld/pytorch_1708025847130/work/torch/csrc/tensor/python_tensor.cpp:451.)
_C._set_default_tensor_type(t)
Loaded in 8.79 seconds
Loaded model...
Loading model `Llama-Guard-3-1B`
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/opt/conda/lib/python3.10/site-packages/torch/__init__.py:696: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered
internally at /opt/conda/conda-bld/pytorch_1708025847130/work/torch/csrc/tensor/python_tensor.cpp:451.)
_C._set_default_tensor_type(t)
Loaded in 4.46 seconds
Loaded model...
Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at /root/.llama/checkpoints/Prompt-Guard-86M and are newly initialized: ...
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [39/1803]
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 343, in <module>
fire.Fire(main)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 279, in main
impls = asyncio.run(resolve_impls(config))
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/resolver.py", line 181, in resolve_impls
impl = await instantiate_provider(
File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/resolver.py", line 268, in instantiate_provider
impl = await fn(*args)
File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/__init__.py", line 16, in get_provider_impl
await impl.initialize()
File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/safety.py", line 40, in initialize
_ = PromptGuardShield.instance(model_dir)
File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/prompt_guard.py", line 37, in instance
PromptGuardShield._instances[key] = PromptGuardShield(
File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/prompt_guard.py", line 66, in __init__
model = AutoModelForSequenceClassification.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4091, in from_pretrained
dispatch_model(model, **device_map_kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/big_modeling.py", line 494, in dispatch_model
model.to(device)
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2958, in to
return super().to(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 39.38 GiB of which 144.25 MiB is free. Process 878221 has 16.76 GiB memory in use. Process 878311 has 3.98 GiB memory in use. Pr
ocess 878077 has 18.48 GiB memory in use. Of the allocated memory 18.08 GiB is allocated by PyTorch, and 1.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=exp
andable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Expected behavior
Expect that the models will be successfully loaded into the GPU VRAM with right memory consumption. Note same configuration does not give errors with a 1xH100 machine.