iree Convert stack allocas to memories.

Jun 17 '24 17:06 lialan

So far hitting an issue in the VM execution. To be specific, look at this particular minimal test dump diff before and after this PR:

Before

%7 = llvm.alloca %6 x f32 {alignment = 64 : i64} : (i64) -> !llvm.ptr
...
llvm.store %24, %7 : f32, !llvm.ptr
...
%33 = llvm.load %7 : !llvm.ptr -> f32
...
llvm.store %39, %7 : f32, !llvm.ptr

After

%7 = llvm.load %arg2 : !llvm.ptr -> !llvm.struct<"iree_hal_executable_workgroup_state_v0_t", (i32, i32, i16, i16, i32, ptr, i32)>
%8 = llvm.extractvalue %7[5] : !llvm.struct<"iree_hal_executable_workgroup_state_v0_t", (i32, i32, i16, i16, i32, ptr, i32)>
...
llvm.store %25, %8 : f32, !llvm.ptr
...
%34 = llvm.load %8 : !llvm.ptr -> f32
...
llvm.store %40, %8 : f32, !llvm.ptr

Notice that the alloca is moved to the beginning address of 6th element in the workgroup state, which is a pointer to local memory: https://github.com/iree-org/iree/blob/9da0309b0491df57629a2177ab1dbec4aa73ae6e/runtime/src/iree/hal/local/executable_library.h#L346

According to the comments, it is possible that the local memory allocation is non-existent (rendering nullptr in this case), or the size is smaller than we expect it to be. Those information needs to be queried at runtime.

@benvanik question: is there a way to determine: whether we will allocate local memory, and the size of local memory at the compilation time? specifically, this happens inside ConvertToLLVM pass.

Jun 26 '24 16:06 lialan

You can use HALDispatchABI::loadWorkgroupLocalMemorySize to get the value at runtime within the dispatch function. The returned value is guaranteed to be at least the declared workgroup local memory requirement and is rounded up to 4096 pages. If an export properly declares the minimum required then it will be present at runtime (and otherwise we error out before executing the dispatch).

--task_worker_local_memory= is the flag at runtime (in the tools) that sets the available memory for dispatches to use. By default it is the L2 (or fallback L1) data cache size and otherwise must be explicitly specified. For platforms that we can't query the cache sizes on we should be guessing 128KB (which may be too large for some and we'll likely want to allow users to override it - some platforms can't afford to waste that much). Since it's never been used there's some big performance caveats around TODO work - the local-sync execution puts a malloc/free around every dispatch for the local memory, for example.

Jul 08 '24 23:07 benvanik

@benvanik Getting back to this PR:

I have done some investigation and I think the transformation in my above comment is correct in the LLVM part. But it hit nullptr in the very first store. So I suspect the workgroup local memory is not allocated.

Specifying task_worker_local_memory did not help. Did I miss anything?

Aug 06 '24 14:08 lialan

(closing as stale)

Apr 30 '25 00:04 benvanik