loopy icon indicating copy to clipboard operation
loopy copied to clipboard

Heap storage for big temporaries

Open wence- opened this issue 4 years ago • 6 comments
trafficstars

Sometimes, when doing very high order things, we need big (multi-megabyte) temporaries. They are quite long-lived so I'd be happy if there were a way to control allocation of the temporaries on the heap rather than the stack. I guess this might well be a target-specific thing, so for now CTarget only would be fine.

Where's a good place to look, or am I doing it all wrong?

wence- avatar Aug 19 '21 18:08 wence-

Some possibilities:

  1. Read-only temporaries in lp.AddressSpace.GLOBAL address space are emitted as static arrays in CTarget.
  2. In the PyOpenCLTarget: for a non-const temporary we do such allocations by setting the address space of the temporary as lp.AddressSpace.GLOBAL. The host code manages the allocations for such temporaries and the final generated device code makes these temporaries as kernel arguments. We could extend CTarget to emit the correct host code to mange these allocations.

IMO extending the loopy IR to include malloc/free semantics is a slightly longer route.

kaushikcfd avatar Aug 19 '21 18:08 kaushikcfd

Agree with @kaushikcfd here: Hacking the CTarget to do the same thing as the CL one would be a way to achieve this. CL also gives control of the allocator, to permit use of a memory pool if/when allocation becomes a bottleneck.

inducer avatar Aug 19 '21 19:08 inducer

If I took the first option, the codegen complains because the deviceprogram is no longer a cgen.FunctionBody. I applied this disgusting patch because I don't know how to map over cgen to collect all the right things:

diff --git a/loopy/codegen/__init__.py b/loopy/codegen/__init__.py
index bb292269..429d3132 100644
--- a/loopy/codegen/__init__.py
+++ b/loopy/codegen/__init__.py
@@ -25,6 +25,8 @@ logger = logging.getLogger(__name__)
 
 import islpy as isl
 
+import cgen
+
 from loopy.diagnostic import LoopyError, warn
 from pytools import ImmutableRecord
 
@@ -782,7 +784,12 @@ def generate_code_v2(program):
             implemented_data_infos[func_id] = cgr.implemented_data_info
         else:
             assert len(cgr.device_programs) == 1
-            callee_fdecls.append(cgr.device_programs[0].ast.fdecl)
+            dp, = cgr.device_programs
+            if isinstance(dp.ast, cgen.Collection):
+                fbody = dp.ast.contents[-1]
+            else:
+                fbody = dp.ast
+            callee_fdecls.append(fbody.fdecl)
 
         device_programs.extend(cgr.device_programs)
         device_preambles.extend(cgr.device_preambles)

wence- avatar Sep 09 '21 15:09 wence-

My main objection to the static arrays would be that they fall apart if their size is parameter-dependent, which IMO is a fairly common use case. (The AST handling from @wence-'s patch could be cleaned up, but...) If repeated mallocs are a cost concern, I think memory pools convincingly get around that. Thoughts?

inducer avatar Sep 23 '21 00:09 inducer

I don't really like static arrays either, but unfortunately I don't understand the backend targets sufficiently to know how to start implementing the alternative approach.

wence- avatar Sep 23 '21 09:09 wence-

The main place to look is this bit of code in the PyOpenCL Python Target:

https://github.com/inducer/loopy/blob/e3431a17bcbcdf9a37b4812bf997729536b232ee/loopy/target/pyopencl.py#L658-L828

(That generates the host code that actually allocates and passes the global temporaries.) An important thing to realize is that pyopencl execution actually uses two targets, a Python one for the host code, and a C-family one for the device code. On the "plain C" side, both tend to be the same, but they don't have to be.

inducer avatar Sep 24 '21 00:09 inducer