[GPU] Avoid memory allocation for any node can reuse previous memory
Details:
There is a strategy to reuse memory for some intermediate outputs. However, it is recognized after the allocation of memory for all intermediate outputs. Therefore, the peak memory is not actually reduced. This PR is to address this problem which can reduce memory footprint for some models, for example 35% memory reduction for SD1.5 vae_decoder model when generating 512*512 images.
Tickets:
build_jenkins
What peak memory you are aiming to reduce? i.e., if eltw is fused to conv, currently
- allocate output of conv 2)Then set output of conv as elt memory and release the output mem allocated at step 1)
Is your problem about the temporal allocation at step1 ?
@Kotomi-Du, how about the following implementation?
- Keep the existing order of allocations and memory reuse for the sum post-op
- Move the logic related to onednn impls node memory reuse check to common place (e.g. to src/plugins/intel_gpu/src/graph/include/program_helpers.h) with a name like
can_node_reuse_fused_eltwise_memory(const program_node& node) - Update
typed_primitive_inst_base::do_allocate_memory()function to check whether the node can reuse memory, and return the corresponding value to skip actual allocation if possible - After
allocate_primitive_instance()calls reuse eltwise's buffer as a new output memory
@Kotomi-Du, how about the following implementation?
- Keep the existing order of allocations and memory reuse for the sum post-op
- Move the logic related to onednn impls node memory reuse check to common place (e.g. to src/plugins/intel_gpu/src/graph/include/program_helpers.h) with a name like
can_node_reuse_fused_eltwise_memory(const program_node& node)- Update
typed_primitive_inst_base::do_allocate_memory()function to check whether the node can reuse memory, and return the corresponding value to skip actual allocation if possible- After
allocate_primitive_instance()calls reuse eltwise's buffer as a new output memory
It makes sense to me because it not only avoids the allocation of unnecessary memory, but also keeps the current logic in OpenVINO. But I am not sure if it helps with the corner cases @isanghao mentioned?
BTW, @sshlyapn can_node_reuse_fused_eltwise_memory will be called in do_allocate_memory(), correct?
@Kotomi-Du, how about the following implementation?
- Keep the existing order of allocations and memory reuse for the sum post-op
- Move the logic related to onednn impls node memory reuse check to common place (e.g. to src/plugins/intel_gpu/src/graph/include/program_helpers.h) with a name like
can_node_reuse_fused_eltwise_memory(const program_node& node)- Update
typed_primitive_inst_base::do_allocate_memory()function to check whether the node can reuse memory, and return the corresponding value to skip actual allocation if possible- After
allocate_primitive_instance()calls reuse eltwise's buffer as a new output memory
I think the problem of this approach of allocate output first -> assign the output mem to its input is the case of something like crop->reshape->conv and we need to handle the opt out chain which is not very straightforward.
build_jenkins
build_jenkins
build_jenkins
build_jenkins
build_jenkins
build_jenkins
LGTM, @isanghao should we run perf tests before merge?
@p-durandin We ran it our smoke benchmark set for the important models. Actually this PR requires running benchdnn for static case too, but we don't have systolic instance. (onepunch doesn't have too)