CUDA out of memory, even when setting INFINITY_DEVICE to cpu.

Open ch9hn opened this issue 8 months ago • 0 comments

System Info

image: michaelf34/infinity:0.0.76

configMap:

HF_HOME=/mnt/llm-models
INFINITY_LOG_LEVEL=debug
INFINITY_ANONYMOUS_USAGE_STATS=0
INFINITY_MODEL_ID="vidore/colpali-v1.2-merged"
INFINITY_BATCH_SIZE=64
INFINITY_PORT=7997
INFINITY_DEVICE=cpu

command:

infinity_emb
v2

Information

[x] Docker + cli
[ ] pip + cli
[ ] pip + usage of Python interface

Tasks

[x] An officially supported CLI command
[ ] My own modifications

Reproduction

May the error is coming from this line: https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/infinity_emb/transformer/vision/torch_vision.py#L105

Logs

DEBUG 2025-04-16 12:43:09,714 infinity_emb DEBUG: Creating engine.py:70 AsyncEmbeddingEngine from EngineArgs(model_name_or_path='vidore/colpali-v1.2-merged ', batch_size=64, revision=None, trust_remote_code=True, engine=<InferenceEngine.torch: 'torch'>, model_warmup=True, vector_disk_cache_path='', device=<Device.cpu: '**cpu**'>, device_id=DeviceID(), compile=False, bettertransformer=True, dtype=<Dtype.auto: 'auto'>, pooling_method=<PoolingMethod.auto: 'auto'>, lengths_via_tokenize=False, embedding_dtype=<EmbeddingDtype.float32: 'float32'>, served_model_name='vidore/colpali-v1.2-merged', _loading_strategy=LoadingStrategy(device_mapping=['**cpu**'], loading_dtype=torch.float32, quantization_dtype=None, device_placement='**cpu**')) INFO 2025-04-16 12:43:09,719 infinity_emb INFO: select_model.py:64 model=vidore/colpali-v1.2-merged selected, using engine=torch and device=cpu The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. ERROR: Traceback (most recent call last): File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 693, in lifespan async with self.lifespan_context(app) as maybe_state: File "/usr/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/app/infinity_emb/infinity_server.py", line 88, in lifespan app.engine_array = AsyncEngineArray.from_args(engine_args_list) # type: ignore File "/app/infinity_emb/engine.py", line 306, in from_args return cls(engines=tuple(engines)) File "/app/infinity_emb/engine.py", line 71, in from_args engine = cls(**engine_args.to_dict(), _show_deprecation_warning=False) File "/app/infinity_emb/engine.py", line 56, in init self._model_replicas, self._min_inference_t, self._max_inference_t = select_model( File "/app/infinity_emb/inference/select_model.py", line 81, in select_model loaded_engine = unloaded_engine.value(engine_args=engine_args_copy) File "/app/infinity_emb/transformer/vision/torch_vision.py", line 105, in init self.model = self.model.cuda() File "/app/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3082, in cuda return super().cuda(*args, **kwargs) File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1053, in cuda return self._apply(lambda t: t.cuda(device)) File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 903, in _apply module._apply(fn) File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 903, in _apply module._apply(fn) File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 903, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 930, in _apply param_applied = fn(param) File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1053, in return self._apply(lambda t: t.cuda(device)) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity of 44.31 GiB of which 93.00 MiB is free. Process 234511 has 37.57 GiB memory in use. Process 2796086 has 6.62 GiB memory in use. Of the allocated memory 6.01 GiB is allocated by PyTorch, and 202.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) ERROR: Application startup failed. Exiting.

Apr 16 '25 12:04 ch9hn