openvino [GPU] Avoid memory allocation for any node can reuse previous memory

Details:

There is a strategy to reuse memory for some intermediate outputs. However, it is recognized after the allocation of memory for all intermediate outputs. Therefore, the peak memory is not actually reduced. This PR is to address this problem which can reduce memory footprint for some models, for example 35% memory reduction for SD1.5 vae_decoder model when generating 512*512 images.

Tickets:

CVS-169229

Jun 13 '25 01:06 Kotomi-Du

build_jenkins

Jun 16 '25 04:06 p-durandin

What peak memory you are aiming to reduce? i.e., if eltw is fused to conv, currently

allocate output of conv 2)Then set output of conv as elt memory and release the output mem allocated at step 1)

Is your problem about the temporal allocation at step1 ?

Jun 16 '25 05:06 yeonbok

@Kotomi-Du, how about the following implementation?

Keep the existing order of allocations and memory reuse for the sum post-op
Move the logic related to onednn impls node memory reuse check to common place (e.g. to src/plugins/intel_gpu/src/graph/include/program_helpers.h) with a name like can_node_reuse_fused_eltwise_memory(const program_node& node)
Update typed_primitive_inst_base::do_allocate_memory() function to check whether the node can reuse memory, and return the corresponding value to skip actual allocation if possible
After allocate_primitive_instance() calls reuse eltwise's buffer as a new output memory

Jun 16 '25 08:06 sshlyapn

@Kotomi-Du, how about the following implementation?

Keep the existing order of allocations and memory reuse for the sum post-op

Move the logic related to onednn impls node memory reuse check to common place (e.g. to src/plugins/intel_gpu/src/graph/include/program_helpers.h) with a name like can_node_reuse_fused_eltwise_memory(const program_node& node)

Update typed_primitive_inst_base::do_allocate_memory() function to check whether the node can reuse memory, and return the corresponding value to skip actual allocation if possible

After allocate_primitive_instance() calls reuse eltwise's buffer as a new output memory

It makes sense to me because it not only avoids the allocation of unnecessary memory, but also keeps the current logic in OpenVINO. But I am not sure if it helps with the corner cases @isanghao mentioned?

BTW, @sshlyapn can_node_reuse_fused_eltwise_memory will be called in do_allocate_memory(), correct?

Jun 16 '25 18:06 Kotomi-Du

@Kotomi-Du, how about the following implementation?

Keep the existing order of allocations and memory reuse for the sum post-op

Move the logic related to onednn impls node memory reuse check to common place (e.g. to src/plugins/intel_gpu/src/graph/include/program_helpers.h) with a name like can_node_reuse_fused_eltwise_memory(const program_node& node)

Update typed_primitive_inst_base::do_allocate_memory() function to check whether the node can reuse memory, and return the corresponding value to skip actual allocation if possible

After allocate_primitive_instance() calls reuse eltwise's buffer as a new output memory

I think the problem of this approach of allocate output first -> assign the output mem to its input is the case of something like crop->reshape->conv and we need to handle the opt out chain which is not very straightforward.

Jun 16 '25 22:06 yeonbok

build_jenkins

Jun 20 '25 06:06 p-durandin

build_jenkins

Jun 24 '25 05:06 p-durandin

build_jenkins

Jun 24 '25 18:06 Kotomi-Du

build_jenkins

Jun 25 '25 05:06 p-durandin

build_jenkins

Jul 02 '25 18:07 yeonbok

build_jenkins

Jul 02 '25 21:07 yeonbok

LGTM, @isanghao should we run perf tests before merge?

@p-durandin We ran it our smoke benchmark set for the important models. Actually this PR requires running benchdnn for static case too, but we don't have systolic instance. (onepunch doesn't have too)

Jul 03 '25 17:07 yeonbok