Local names linking
Overview
This PR overhauls the way linking works in Julia, both in the JIT and AOT. The point is to enable us to generate LLVM IR that depends only on the source IR, eliminating both nondeterminism and the effect of redefining methods in the same session. This serves two purposes. First, if the IR is predictable, we can cache the compilation by using the bitcode hash as a key, like how the ThinLTO cache works. #58592 was an early experiment along these lines. Second, we can reuse work that was done in a previous session, like pkgimages, but for the JIT.
We accomplish this by generating names that are unique only within the current
LLVM module, removing most uses of the globalUniqueGeneratedNames counter.
The replacement for jl_codegen_params_t, jl_codegen_output_t, represents a
Julia "translation unit", and tracks the information we'll need to link the
compiled module into the running session. When linking, we manipulate the
JITLink LinkGraph (after compilation) instead of renaming
functions in the LLVM IR (before).
Example
julia> @noinline foo(x) = x + 2.0
baz(x) = foo(foo(x))
code_llvm(baz, (Int64,); dump_module=true, optimize=false)
Nightly:
[...]
@"+Core.Float64#774" = private unnamed_addr constant ptr @"+Core.Float64#774.jit"
@"+Core.Float64#774.jit" = private alias ptr, inttoptr (i64 4797624416 to ptr)
; Function Signature: baz(Int64)
; @ REPL[1]:2 within `baz`
define double @julia_baz_772(i64 signext %"x::Int64") #0 {
top:
%pgcstack = call ptr @julia.get_pgcstack()
%0 = call double @j_foo_775(i64 signext %"x::Int64")
%1 = call double @j_foo_776(double %0)
ret double %1
}
; Function Attrs: noinline optnone
define nonnull ptr @jfptr_baz_773(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 {
top:
%pgcstack = call ptr @julia.get_pgcstack()
%0 = getelementptr inbounds i8, ptr %"args::Any[]", i32 0
%1 = load ptr, ptr %0, align 8
%.unbox = load i64, ptr %1, align 8
%2 = call double @julia_baz_772(i64 signext %.unbox)
%"+Core.Float64#774" = load ptr, ptr @"+Core.Float64#774", align 8
%Float64 = ptrtoint ptr %"+Core.Float64#774" to i64
%3 = inttoptr i64 %Float64 to ptr
%current_task = getelementptr inbounds i8, ptr %pgcstack, i32 -152
%"box::Float64" = call noalias nonnull align 8 dereferenceable(8) ptr @julia.gc_alloc_obj(ptr %current_task, i64 8, ptr %3) #5
store double %2, ptr %"box::Float64", align 8
ret ptr %"box::Float64"
}
[...]
Diff after this PR. Notice how each symbol gets the lowest possible integer
suffix that will make it unique to the module, and how the two specializations
for foo get different names:
@@ -4,18 +4,18 @@
target triple = "arm64-apple-darwin24.6.0"
-@"+Core.Float64#774" = external global ptr
+@"+Core.Float64#_0" = external global ptr
; Function Signature: baz(Int64)
; @ REPL[1]:2 within `baz`
-define double @julia_baz_772(i64 signext %"x::Int64") #0 {
+define double @julia_baz_0(i64 signext %"x::Int64") #0 {
top:
%pgcstack = call ptr @julia.get_pgcstack()
- %0 = call double @j_foo_775(i64 signext %"x::Int64")
- %1 = call double @j_foo_776(double %0)
+ %0 = call double @j_foo_0(i64 signext %"x::Int64")
+ %1 = call double @j_foo_1(double %0)
ret double %1
}
; Function Attrs: noinline optnone
-define nonnull ptr @jfptr_baz_773(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 {
+define nonnull ptr @jfptr_baz_0(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 {
top:
%pgcstack = call ptr @julia.get_pgcstack()
@@ -23,7 +23,7 @@
%1 = load ptr, ptr %0, align 8
%.unbox = load i64, ptr %1, align 8
- %2 = call double @julia_baz_772(i64 signext %.unbox)
- %"+Core.Float64#774" = load ptr, ptr @"+Core.Float64#774", align 8
- %Float64 = ptrtoint ptr %"+Core.Float64#774" to i64
+ %2 = call double @julia_baz_0(i64 signext %.unbox)
+ %"+Core.Float64#_0" = load ptr, ptr @"+Core.Float64#_0", align 8
+ %Float64 = ptrtoint ptr %"+Core.Float64#_0" to i64
%3 = inttoptr i64 %Float64 to ptr
%current_task = getelementptr inbounds i8, ptr %pgcstack, i32 -152
@@ -39,8 +39,8 @@
; Function Signature: foo(Int64)
-declare double @j_foo_775(i64 signext) #3
+declare double @j_foo_0(i64 signext) #3
; Function Signature: foo(Float64)
-declare double @j_foo_776(double) #4
+declare double @j_foo_1(double) #4
attributes #0 = { "frame-pointer"="all" "julia.fsig"="baz(Int64)" "probe-stack"="inline-asm" }
List of changes
-
Many sources of statefulness and nondeterminism in the emitted LLVM IR have been eliminated, namely:
- Function symbols defined for CodeInstances
- Global symbols referring to data on the Julia heap
- Undefined function symbols referring to invoked external CodeInstances
-
jl_codeinst_params_thas becomejl_codegen_output_t. It now represents one Julia "translation unit". More than one CodeInstance can be emitted to the samejl_codegen_output_t, if desired, though in the JIT every CI gets its own right now. One motivation behind this is to allow us to emit code on multiple threads and avoid the bitcode serialize/deserialize step we currently do, if that proves worthwhile.When we are done emitting to a
jl_codegen_output_t, we call.finish(), which discards the intermediate state and returns only the LLVM module and the info needed for linking (jl_linker_info_t). -
The new
JLMaterializationUnitwraps compiled Julia object files and the associatedjl_linker_info_t. It informs ORC that we can materialize symbols for the CIs defined by that output, and picks globally unique names for them. When it is materialized, it resolves all the call targets and generates trampolines for CodeInstances that are invoked but have the wrong calling convention, or are not yet compiled. -
We now postpone linking decisions to after codegen whenever possible. For example,
emit_invokeno longer tries to find a compiled version of the CodeInstance, and it no longer generates trampolines to adapt calling conventions.jl_analyze_workqueue's job has been absorbed intoJuliaOJIT::linkOutput. -
Some
image_codegendifferences have been removed:- Globals for Julia heap addresses no longer get initialized, so the resulting IR won't have the addresses embedded. I expect the impact of this to be small on RISC-y platforms, where it is typical to load address-sized values out of a constant pool.
- Codegen no longer cares if a compiled CodeInstance came from an image. During ahead-of-time linking, we generate thunk functions that load the address from the fvars table.
-
In
jl_emit_native_impl, emit every CodeInstance into onejl_codegen_output_t. We now defer the creation of thellvm::Linkerfor llvmcalls, which has construction cost that grows with the size of the destination module, until the very end.
General refactoring
- Adapt the
jl_callingconv_tenum fromstaticdata.cintojl_invoke_api_tand use it in more places. There is one enumerator for each specialjl_callptr_tfunction that can go in a CodeInstance'sinvokefield, as well as one that indicates an invoke wrapper should be there. There is a convenience function for reading an invoke pointer and getting the API type, and vice versa. - Avoid using magic string values, and try to directly pass pointers to LLVM
Function *or ORC string pool entries when possible.
Remaining TODO items
-
[X] RTDyld: on this branch, it is removed completely. I will pursue one of these two options: - ~~Use the ahead-of-time linking to get it working again.~~ - [X] Port over the memory management to JITLink and use that on all platforms.
-
[ ]
DLSymOptimizeris unused. It will be replaced with an ORC MaterializationUnit that, when materialized, defines the symbols as absolute addresses (with a fallback that generates ajlpltfunction). -
[ ] Since
tojlinvokeand other trampolines don't take long to compile, we just compile them while holding theJuliaOJIT::LinkerMutex. Since we most often generatetojlinvokewrappers when an invoked CodeInstance is not yet compiled, it is my intention to eventually replace this with a GOT/PLT mechanism that will also allow us to start running code before all of the edges are compiled. -
[ ] I have yet to measure the impact of global addresses not being visible to the LLVM optimizer or code generation. If it turns out to be important to have immediate addresses, I'd like to try using external LLVM globals address values directly, since that can generate code with immediate relocations, and LLVM can assume the address won't alias.
-
[ ] We should support ahead-of-time linking multiple
jl_codegen_output_ts together. -
[ ] We still pass strings to
emit_call_specfun_other, even though the prototype for the function is now created byjl_codegen_output_t::get_call_target. We should hold on to the calling convention info so it doesn't have to be recomputed.
eliminating both nondeterminism and the effect of redefining methods in the same session
there are several open issues observing inference changes when methods are redefined; does this PR affect those?
No, this PR only changes code generation.
This new commit fixes some horrible code generation in emit_pkg_plt_thunk by just emitting inline assembly, using PLT thunks stolen from LLD. This will be less hacky when it happens after linking. Since that requires the renaming of symbols post-compilation, it is out of scope for this PR.
it is my intention to eventually replace this with a GOT/PLT mechanism that will also allow us to start running code before all of the edges are compiled.
@pchintalapudi experimented with that (and there is some data in his thesis, and likely some old PR floating around)
IIRC there is a CompileOnDemandLayer
I'm guessing the PR in question is #44575? Some of the concerns have been addressed by #55106 and #56179, at least, so it would be easier to try this again now. JITLink also seems far more complete than it was then.