CUDA out of memory, even when setting INFINITY_DEVICE to cpu.
System Info
image: michaelf34/infinity:0.0.76
configMap:
- HF_HOME=/mnt/llm-models
- INFINITY_LOG_LEVEL=debug
- INFINITY_ANONYMOUS_USAGE_STATS=0
- INFINITY_MODEL_ID="vidore/colpali-v1.2-merged"
- INFINITY_BATCH_SIZE=64
- INFINITY_PORT=7997
- INFINITY_DEVICE=cpu
command:
- infinity_emb
- v2
Information
- [x] Docker + cli
- [ ] pip + cli
- [ ] pip + usage of Python interface
Tasks
- [x] An officially supported CLI command
- [ ] My own modifications
Reproduction
May the error is coming from this line: https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/infinity_emb/transformer/vision/torch_vision.py#L105
Logs
DEBUG 2025-04-16 12:43:09,714 infinity_emb DEBUG: Creating engine.py:70
AsyncEmbeddingEngine from
EngineArgs(model_name_or_path='vidore/colpali-v1.2-merged ', batch_size=64, revision=None, trust_remote_code=True, engine=<InferenceEngine.torch: 'torch'>, model_warmup=True, vector_disk_cache_path='', device=<Device.cpu: '**cpu**'>, device_id=DeviceID(), compile=False, bettertransformer=True, dtype=<Dtype.auto: 'auto'>, pooling_method=<PoolingMethod.auto: 'auto'>, lengths_via_tokenize=False, embedding_dtype=<EmbeddingDtype.float32: 'float32'>, served_model_name='vidore/colpali-v1.2-merged', _loading_strategy=LoadingStrategy(device_mapping=['**cpu**'], loading_dtype=torch.float32, quantization_dtype=None, device_placement='**cpu**'))
INFO 2025-04-16 12:43:09,719 infinity_emb INFO: select_model.py:64
model=vidore/colpali-v1.2-merged selected, using
engine=torch and device=cpu
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
ERROR: Traceback (most recent call last):
File "/app/.venv/lib/python3.10/site-packages/starlette/routing.py", line 693, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/app/infinity_emb/infinity_server.py", line 88, in lifespan
app.engine_array = AsyncEngineArray.from_args(engine_args_list) # type: ignore
File "/app/infinity_emb/engine.py", line 306, in from_args
return cls(engines=tuple(engines))
File "/app/infinity_emb/engine.py", line 71, in from_args
engine = cls(**engine_args.to_dict(), _show_deprecation_warning=False)
File "/app/infinity_emb/engine.py", line 56, in init
self._model_replicas, self._min_inference_t, self._max_inference_t = select_model(
File "/app/infinity_emb/inference/select_model.py", line 81, in select_model
loaded_engine = unloaded_engine.value(engine_args=engine_args_copy)
File "/app/infinity_emb/transformer/vision/torch_vision.py", line 105, in init
self.model = self.model.cuda()
File "/app/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3082, in cuda
return super().cuda(*args, **kwargs)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1053, in cuda
return self._apply(lambda t: t.cuda(device))
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 903, in _apply
module._apply(fn)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 903, in _apply
module._apply(fn)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 903, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 930, in _apply
param_applied = fn(param)
File "/app/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1053, in