openvino icon indicating copy to clipboard operation
openvino copied to clipboard

[GPU] Avoid memory allocation for any node can reuse previous memory

Open Kotomi-Du opened this issue 6 months ago • 5 comments

Details:

There is a strategy to reuse memory for some intermediate outputs. However, it is recognized after the allocation of memory for all intermediate outputs. Therefore, the peak memory is not actually reduced. This PR is to address this problem which can reduce memory footprint for some models, for example 35% memory reduction for SD1.5 vae_decoder model when generating 512*512 images.

Tickets:

Kotomi-Du avatar Jun 13 '25 01:06 Kotomi-Du

build_jenkins

p-durandin avatar Jun 16 '25 04:06 p-durandin

What peak memory you are aiming to reduce? i.e., if eltw is fused to conv, currently

  1. allocate output of conv 2)Then set output of conv as elt memory and release the output mem allocated at step 1)

Is your problem about the temporal allocation at step1 ?

yeonbok avatar Jun 16 '25 05:06 yeonbok

@Kotomi-Du, how about the following implementation?

  1. Keep the existing order of allocations and memory reuse for the sum post-op
  2. Move the logic related to onednn impls node memory reuse check to common place (e.g. to src/plugins/intel_gpu/src/graph/include/program_helpers.h) with a name like can_node_reuse_fused_eltwise_memory(const program_node& node)
  3. Update typed_primitive_inst_base::do_allocate_memory() function to check whether the node can reuse memory, and return the corresponding value to skip actual allocation if possible
  4. After allocate_primitive_instance() calls reuse eltwise's buffer as a new output memory

sshlyapn avatar Jun 16 '25 08:06 sshlyapn

@Kotomi-Du, how about the following implementation?

  1. Keep the existing order of allocations and memory reuse for the sum post-op
  2. Move the logic related to onednn impls node memory reuse check to common place (e.g. to src/plugins/intel_gpu/src/graph/include/program_helpers.h) with a name like can_node_reuse_fused_eltwise_memory(const program_node& node)
  3. Update typed_primitive_inst_base::do_allocate_memory() function to check whether the node can reuse memory, and return the corresponding value to skip actual allocation if possible
  4. After allocate_primitive_instance() calls reuse eltwise's buffer as a new output memory

It makes sense to me because it not only avoids the allocation of unnecessary memory, but also keeps the current logic in OpenVINO. But I am not sure if it helps with the corner cases @isanghao mentioned?

BTW, @sshlyapn can_node_reuse_fused_eltwise_memory will be called in do_allocate_memory(), correct?

Kotomi-Du avatar Jun 16 '25 18:06 Kotomi-Du

@Kotomi-Du, how about the following implementation?

  1. Keep the existing order of allocations and memory reuse for the sum post-op
  2. Move the logic related to onednn impls node memory reuse check to common place (e.g. to src/plugins/intel_gpu/src/graph/include/program_helpers.h) with a name like can_node_reuse_fused_eltwise_memory(const program_node& node)
  3. Update typed_primitive_inst_base::do_allocate_memory() function to check whether the node can reuse memory, and return the corresponding value to skip actual allocation if possible
  4. After allocate_primitive_instance() calls reuse eltwise's buffer as a new output memory

I think the problem of this approach of allocate output first -> assign the output mem to its input is the case of something like crop->reshape->conv and we need to handle the opt out chain which is not very straightforward.

yeonbok avatar Jun 16 '25 22:06 yeonbok

build_jenkins

p-durandin avatar Jun 20 '25 06:06 p-durandin

build_jenkins

p-durandin avatar Jun 24 '25 05:06 p-durandin

build_jenkins

Kotomi-Du avatar Jun 24 '25 18:06 Kotomi-Du

build_jenkins

p-durandin avatar Jun 25 '25 05:06 p-durandin

build_jenkins

yeonbok avatar Jul 02 '25 18:07 yeonbok

build_jenkins

yeonbok avatar Jul 02 '25 21:07 yeonbok

LGTM, @isanghao should we run perf tests before merge?

@p-durandin We ran it our smoke benchmark set for the important models. Actually this PR requires running benchdnn for static case too, but we don't have systolic instance. (onepunch doesn't have too)

yeonbok avatar Jul 03 '25 17:07 yeonbok