julia icon indicating copy to clipboard operation
julia copied to clipboard

Local names linking

Open xal-0 opened this issue 1 month ago • 5 comments

Overview

This PR overhauls the way linking works in Julia, both in the JIT and AOT. The point is to enable us to generate LLVM IR that depends only on the source IR, eliminating both nondeterminism and the effect of redefining methods in the same session. This serves two purposes. First, if the IR is predictable, we can cache the compilation by using the bitcode hash as a key, like how the ThinLTO cache works. #58592 was an early experiment along these lines. Second, we can reuse work that was done in a previous session, like pkgimages, but for the JIT.

We accomplish this by generating names that are unique only within the current LLVM module, removing most uses of the globalUniqueGeneratedNames counter. The replacement for jl_codegen_params_t, jl_codegen_output_t, represents a Julia "translation unit", and tracks the information we'll need to link the compiled module into the running session. When linking, we manipulate the JITLink LinkGraph (after compilation) instead of renaming functions in the LLVM IR (before).

Example

julia> @noinline foo(x) = x + 2.0
       baz(x) = foo(foo(x))

       code_llvm(baz, (Int64,); dump_module=true, optimize=false)

Nightly:

[...]
@"+Core.Float64#774" = private unnamed_addr constant ptr @"+Core.Float64#774.jit"
@"+Core.Float64#774.jit" = private alias ptr, inttoptr (i64 4797624416 to ptr)

; Function Signature: baz(Int64)
;  @ REPL[1]:2 within `baz`
define double @julia_baz_772(i64 signext %"x::Int64") #0 {
top:
  %pgcstack = call ptr @julia.get_pgcstack()
  %0 = call double @j_foo_775(i64 signext %"x::Int64")
  %1 = call double @j_foo_776(double %0)
  ret double %1
}

; Function Attrs: noinline optnone
define nonnull ptr @jfptr_baz_773(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 {
top:
  %pgcstack = call ptr @julia.get_pgcstack()
  %0 = getelementptr inbounds i8, ptr %"args::Any[]", i32 0
  %1 = load ptr, ptr %0, align 8
  %.unbox = load i64, ptr %1, align 8
  %2 = call double @julia_baz_772(i64 signext %.unbox)
  %"+Core.Float64#774" = load ptr, ptr @"+Core.Float64#774", align 8
  %Float64 = ptrtoint ptr %"+Core.Float64#774" to i64
  %3 = inttoptr i64 %Float64 to ptr
  %current_task = getelementptr inbounds i8, ptr %pgcstack, i32 -152
  %"box::Float64" = call noalias nonnull align 8 dereferenceable(8) ptr @julia.gc_alloc_obj(ptr %current_task, i64 8, ptr %3) #5
  store double %2, ptr %"box::Float64", align 8
  ret ptr %"box::Float64"
}
[...]

Diff after this PR. Notice how each symbol gets the lowest possible integer suffix that will make it unique to the module, and how the two specializations for foo get different names:

@@ -4,18 +4,18 @@
 target triple = "arm64-apple-darwin24.6.0"
 
-@"+Core.Float64#774" = external global ptr
+@"+Core.Float64#_0" = external global ptr
 
 ; Function Signature: baz(Int64)
 ;  @ REPL[1]:2 within `baz`
-define double @julia_baz_772(i64 signext %"x::Int64") #0 {
+define double @julia_baz_0(i64 signext %"x::Int64") #0 {
 top:
   %pgcstack = call ptr @julia.get_pgcstack()
-  %0 = call double @j_foo_775(i64 signext %"x::Int64")
-  %1 = call double @j_foo_776(double %0)
+  %0 = call double @j_foo_0(i64 signext %"x::Int64")
+  %1 = call double @j_foo_1(double %0)
   ret double %1
 }
 
 ; Function Attrs: noinline optnone
-define nonnull ptr @jfptr_baz_773(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 {
+define nonnull ptr @jfptr_baz_0(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 {
 top:
   %pgcstack = call ptr @julia.get_pgcstack()
@@ -23,7 +23,7 @@
   %1 = load ptr, ptr %0, align 8
   %.unbox = load i64, ptr %1, align 8
-  %2 = call double @julia_baz_772(i64 signext %.unbox)
-  %"+Core.Float64#774" = load ptr, ptr @"+Core.Float64#774", align 8
-  %Float64 = ptrtoint ptr %"+Core.Float64#774" to i64
+  %2 = call double @julia_baz_0(i64 signext %.unbox)
+  %"+Core.Float64#_0" = load ptr, ptr @"+Core.Float64#_0", align 8
+  %Float64 = ptrtoint ptr %"+Core.Float64#_0" to i64
   %3 = inttoptr i64 %Float64 to ptr
   %current_task = getelementptr inbounds i8, ptr %pgcstack, i32 -152
@@ -39,8 +39,8 @@
 
 ; Function Signature: foo(Int64)
-declare double @j_foo_775(i64 signext) #3
+declare double @j_foo_0(i64 signext) #3
 
 ; Function Signature: foo(Float64)
-declare double @j_foo_776(double) #4
+declare double @j_foo_1(double) #4
 
 attributes #0 = { "frame-pointer"="all" "julia.fsig"="baz(Int64)" "probe-stack"="inline-asm" }

List of changes

  • Many sources of statefulness and nondeterminism in the emitted LLVM IR have been eliminated, namely:

    • Function symbols defined for CodeInstances
    • Global symbols referring to data on the Julia heap
    • Undefined function symbols referring to invoked external CodeInstances
  • jl_codeinst_params_t has become jl_codegen_output_t. It now represents one Julia "translation unit". More than one CodeInstance can be emitted to the same jl_codegen_output_t, if desired, though in the JIT every CI gets its own right now. One motivation behind this is to allow us to emit code on multiple threads and avoid the bitcode serialize/deserialize step we currently do, if that proves worthwhile.

    When we are done emitting to a jl_codegen_output_t, we call .finish(), which discards the intermediate state and returns only the LLVM module and the info needed for linking (jl_linker_info_t).

  • The new JLMaterializationUnit wraps compiled Julia object files and the associated jl_linker_info_t. It informs ORC that we can materialize symbols for the CIs defined by that output, and picks globally unique names for them. When it is materialized, it resolves all the call targets and generates trampolines for CodeInstances that are invoked but have the wrong calling convention, or are not yet compiled.

  • We now postpone linking decisions to after codegen whenever possible. For example, emit_invoke no longer tries to find a compiled version of the CodeInstance, and it no longer generates trampolines to adapt calling conventions. jl_analyze_workqueue's job has been absorbed into JuliaOJIT::linkOutput.

  • Some image_codegen differences have been removed:

    • Globals for Julia heap addresses no longer get initialized, so the resulting IR won't have the addresses embedded. I expect the impact of this to be small on RISC-y platforms, where it is typical to load address-sized values out of a constant pool.
    • Codegen no longer cares if a compiled CodeInstance came from an image. During ahead-of-time linking, we generate thunk functions that load the address from the fvars table.
  • In jl_emit_native_impl, emit every CodeInstance into one jl_codegen_output_t. We now defer the creation of the llvm::Linker for llvmcalls, which has construction cost that grows with the size of the destination module, until the very end.

General refactoring

  • Adapt the jl_callingconv_t enum from staticdata.c into jl_invoke_api_t and use it in more places. There is one enumerator for each special jl_callptr_t function that can go in a CodeInstance's invoke field, as well as one that indicates an invoke wrapper should be there. There is a convenience function for reading an invoke pointer and getting the API type, and vice versa.
  • Avoid using magic string values, and try to directly pass pointers to LLVM Function * or ORC string pool entries when possible.

Remaining TODO items

  • [X] RTDyld: on this branch, it is removed completely. I will pursue one of these two options: - ~~Use the ahead-of-time linking to get it working again.~~ - [X] Port over the memory management to JITLink and use that on all platforms.

  • [ ] DLSymOptimizer is unused. It will be replaced with an ORC MaterializationUnit that, when materialized, defines the symbols as absolute addresses (with a fallback that generates a jlplt function).

  • [ ] Since tojlinvoke and other trampolines don't take long to compile, we just compile them while holding the JuliaOJIT::LinkerMutex. Since we most often generate tojlinvoke wrappers when an invoked CodeInstance is not yet compiled, it is my intention to eventually replace this with a GOT/PLT mechanism that will also allow us to start running code before all of the edges are compiled.

  • [ ] I have yet to measure the impact of global addresses not being visible to the LLVM optimizer or code generation. If it turns out to be important to have immediate addresses, I'd like to try using external LLVM globals address values directly, since that can generate code with immediate relocations, and LLVM can assume the address won't alias.

  • [ ] We should support ahead-of-time linking multiple jl_codegen_output_ts together.

  • [ ] We still pass strings to emit_call_specfun_other, even though the prototype for the function is now created by jl_codegen_output_t::get_call_target. We should hold on to the calling convention info so it doesn't have to be recomputed.

xal-0 avatar Nov 04 '25 00:11 xal-0

eliminating both nondeterminism and the effect of redefining methods in the same session

there are several open issues observing inference changes when methods are redefined; does this PR affect those?

adienes avatar Nov 14 '25 01:11 adienes

No, this PR only changes code generation.

xal-0 avatar Nov 14 '25 18:11 xal-0

This new commit fixes some horrible code generation in emit_pkg_plt_thunk by just emitting inline assembly, using PLT thunks stolen from LLD. This will be less hacky when it happens after linking. Since that requires the renaming of symbols post-compilation, it is out of scope for this PR.

xal-0 avatar Nov 18 '25 23:11 xal-0

it is my intention to eventually replace this with a GOT/PLT mechanism that will also allow us to start running code before all of the edges are compiled.

@pchintalapudi experimented with that (and there is some data in his thesis, and likely some old PR floating around)

IIRC there is a CompileOnDemandLayer

vchuravy avatar Dec 01 '25 20:12 vchuravy

I'm guessing the PR in question is #44575? Some of the concerns have been addressed by #55106 and #56179, at least, so it would be easier to try this again now. JITLink also seems far more complete than it was then.

xal-0 avatar Dec 01 '25 20:12 xal-0