iree-run-module out-of-memory when loading args for pipeline parallel Deepseek
What happened?
It seems that iree-run-module is running out of memory when trying to load function arguments.
c/runtime/src/iree/hal/drivers/hip/hip_allocator.c:507: RESOURCE_EXHAUSTED; HIP driver error 'hipErrorOutOfMemory' (2): out of memory; parsing input `@/home/bpetkant/ws/sharktank/experiments/deepseek/tracy-tracing/gen/inputs/prefill/arg3+.npy`; parsing function inputs
This happens on Mi300X which has 192 GB of memory. The model is split in 8 pieces across 8 GPUs. The model parameters are 1250 GB. At most we are loading 4 layers on a GPU. There are 61 layers. This gives us ~164 GB of weights per GPU. The KV cache is 12 GB per GPU. We should have enough memory to load the model and arguments.
Steps to reproduce your issue
- Download and extract deepseek-pp-oom.zip.
- Run
./compile.sh
python gen_prefill_inputs.py
./run-prefill.sh
In order to run the model you will need the model weights which can be found on internal AMD machine sharkmi300x-3 and are expected to be at location /shark-dev/weights/deepseek_v3/fp16/deepseek_v3_f16.irpa.
What component(s) does this issue relate to?
No response
Version information
3.5.0rc20250529
Additional context
No response
I don't think anyone is going to try to reproduce this given the steps provided. In cases like these you will need to capture a Tracy trace and post it. It is not iree-run-module that is running out of memory - it is your program - and the fix will likely lie there - we can provide assistance but can't fix your program ourselves.
I have the suspicion that all arguments get loaded into a single device. Does the module encode in its function signature device placement of the arguments, such that then iree-run-module can read this information and send the arguments to the desired destinations?
No, all arguments are loaded on the first device. Generally arguments should be small. Are you passing giant parameters as arguments?
Yes, the KV cache is an argument that is split across all devices and is pretty large. 12 GB per device.
The way to do that today would be to have a harness module that does your setup how you want instead of relying on command line arguments. See https://github.com/google/iree/blob/16f937893cd4a8edf92bdbc4227a5ab0049373c3/samples/multiple_modules/README.md for an example.