[Runtime] Bug in tracing execution of module for allocations larger than 4GiB.
What happened?
Hi. I'm trying to compile and run a Phi model to analyze memory allocation (constants, transients, etc.).
While using iree-run-module, I encountered what seems to be a bug in how it traces some commands with the flag --trace_execution=true.
The final IR of the compilation can be found here. In line 2399 of the IR, there is the call for the device allocation for the constants (5259886592 B = ~4.9 GiB). This is the allocation operation that is actually executed in the conditional statement where the module tries 2 ways to allocate that memory (this can be seen in the execution trace later).
In line 69 of the execution output, we can see the corresponding call for the device allocation mentioned before. However, the allocation size has changed from 5259886592 to 964919296.
I confirmed in 2 ways that the correct number is 5259886592 (by looking at the composite attribute at the top of the IR) and that it is actually the allocation size (by looking at a cuda nsight report).
I believe the tracer incorrectly assumes the argument (allocation size) is stored in a single 32-bit register.
In line 69 of the execution output, where the allocation request is, we can see that register %i4 contains the allocation size. However, in line 7, this register is shown as part of a group of 2 registers, %i4 and %i5 for a total of 64 bits. These registers combined contain the correct value. 5259886592 is represented by 33 bits and if truncated to 32 bits and 1 register, becomes 964919296, which is why I think the tracer incorrectly parses the arguments.
5259886592 (0x139838000) → lower 32 bits = 0x39838000 = 964919296
Steps to reproduce your issue
I downloaded the model and converted it to onnx via this script.
Then I imported it to MLIR via iree-import-onnx.
Then I compiled it using the following command:
iree-compile --iree-pretty-debuginfo \
--iree-disable-threading \
--iree-print-ir-after=iree-vm-ordinal-allocation \
--iree-print-ir-module-scope \
--iree-hal-target-device=cuda \
--iree-cuda-target=cuda[0] \
--iree-cuda-target=sm_70 \
model.mlir -o model.vmfb 2>&1 > final-ir.mlir
And ran it using the following command:
iree-run-module --trace_execution=true \
--print_statistics=true \
--device=cuda \
--module=model.vmfb \
--function=main_graph \
--input=\"2x512xi64=3\" \
--input=\"2x512xi64=1\" > execution-output.txt 2>&1
What component(s) does this issue relate to?
Runtime
Version information
I am in commit hash 33f5b3178dda4322464e0d1e3ff1ac12833c9d7a.
Additional context
No response
This is a quirk of the disassembler: vm.call only prints the base register of a 2xi32 pair. If you set a breakpoint in iree_hal_module_allocator_allocate the value should be correct.
Thanks, this workaround works for me.