[Metal] deferred codegen does not lower thread id intrinsics
We currently handle our fake intrinsics for thread id etc. (add_input_arguments!) on finish_module if we are dealing with a kernel (job.config.kernel = true). This all happens before deferred codegen and linking those deferred modules.
Now if our kernel contains deferred codegen jobs which are not kernels (job.config.kernel = false) (using Enzyme for example) we end up with julia.air.thread_position_in_grid.i32 in our main kernel after linking. Which will not be removed since finish_module already ran before. The deferred code is not a kernel and thus never ran add_input_arguments! during its finish_module.
This PR solves this issue by calling add_input_arguments! in finish_ir which is called after linking.
cc @wsmoses
Other Solutions / Discussion
- Have Enzyme create deferred codegen jobs for kernels with the kernel flag set to true
- Run
add_input_argumentsduring IR post-processing
I don't like the pass being invoked twice. Only calling it in finish_ir! may regress code quality though, as a position index coming from an argument is quite different than one coming from a function call (i.e., in a way that would affect optimization). Which is now already the case for Enzyme.jl-generated code, apparently.
As an in-between solution we could call add_input_arguments for non-kernels in finish_module
Or alternatively: Conditionally perform add_input_arguments after linking if the parent is a kernel.
Hopefully the PR series #582 #633 and #634 will also fix this.