iree icon indicating copy to clipboard operation
iree copied to clipboard

iree-run-module out-of-memory when loading args for pipeline parallel Deepseek

Open sogartar opened this issue 7 months ago • 5 comments

What happened?

It seems that iree-run-module is running out of memory when trying to load function arguments.

c/runtime/src/iree/hal/drivers/hip/hip_allocator.c:507: RESOURCE_EXHAUSTED; HIP driver error 'hipErrorOutOfMemory' (2): out of memory; parsing input `@/home/bpetkant/ws/sharktank/experiments/deepseek/tracy-tracing/gen/inputs/prefill/arg3+.npy`; parsing function inputs

This happens on Mi300X which has 192 GB of memory. The model is split in 8 pieces across 8 GPUs. The model parameters are 1250 GB. At most we are loading 4 layers on a GPU. There are 61 layers. This gives us ~164 GB of weights per GPU. The KV cache is 12 GB per GPU. We should have enough memory to load the model and arguments.

Steps to reproduce your issue

  1. Download and extract deepseek-pp-oom.zip.
  2. Run
./compile.sh
python gen_prefill_inputs.py
./run-prefill.sh

In order to run the model you will need the model weights which can be found on internal AMD machine sharkmi300x-3 and are expected to be at location /shark-dev/weights/deepseek_v3/fp16/deepseek_v3_f16.irpa.

What component(s) does this issue relate to?

No response

Version information

3.5.0rc20250529

Additional context

No response

sogartar avatar May 30 '25 09:05 sogartar

I don't think anyone is going to try to reproduce this given the steps provided. In cases like these you will need to capture a Tracy trace and post it. It is not iree-run-module that is running out of memory - it is your program - and the fix will likely lie there - we can provide assistance but can't fix your program ourselves.

benvanik avatar May 30 '25 14:05 benvanik

I have the suspicion that all arguments get loaded into a single device. Does the module encode in its function signature device placement of the arguments, such that then iree-run-module can read this information and send the arguments to the desired destinations?

sogartar avatar May 30 '25 15:05 sogartar

No, all arguments are loaded on the first device. Generally arguments should be small. Are you passing giant parameters as arguments?

benvanik avatar May 30 '25 15:05 benvanik

Yes, the KV cache is an argument that is split across all devices and is pretty large. 12 GB per device.

sogartar avatar May 30 '25 15:05 sogartar

The way to do that today would be to have a harness module that does your setup how you want instead of relying on command line arguments. See https://github.com/google/iree/blob/16f937893cd4a8edf92bdbc4227a5ab0049373c3/samples/multiple_modules/README.md for an example.

benvanik avatar May 30 '25 15:05 benvanik