GPUCompiler.jl [Metal] deferred codegen does not lower thread id intrinsics

We currently handle our fake intrinsics for thread id etc. (add_input_arguments!) on finish_module if we are dealing with a kernel (job.config.kernel = true). This all happens before deferred codegen and linking those deferred modules.

Now if our kernel contains deferred codegen jobs which are not kernels (job.config.kernel = false) (using Enzyme for example) we end up with julia.air.thread_position_in_grid.i32 in our main kernel after linking. Which will not be removed since finish_module already ran before. The deferred code is not a kernel and thus never ran add_input_arguments! during its finish_module.

This PR solves this issue by calling add_input_arguments! in finish_ir which is called after linking.

cc @wsmoses

Other Solutions / Discussion

Have Enzyme create deferred codegen jobs for kernels with the kernel flag set to true
Run add_input_arguments during IR post-processing

Sep 03 '24 11:09 tgymnich

I don't like the pass being invoked twice. Only calling it in finish_ir! may regress code quality though, as a position index coming from an argument is quite different than one coming from a function call (i.e., in a way that would affect optimization). Which is now already the case for Enzyme.jl-generated code, apparently.

Sep 03 '24 14:09 maleadt

As an in-between solution we could call add_input_arguments for non-kernels in finish_module

Or alternatively: Conditionally perform add_input_arguments after linking if the parent is a kernel.

Sep 03 '24 14:09 tgymnich

Hopefully the PR series #582 #633 and #634 will also fix this.

Sep 26 '24 14:09 vchuravy