llvm [RFC] thinLTO for SYCL

Hi all,

I'd like to get some architecture and code-organzation level comments on my SYCL thinLTO prototype.

The feature isn't ready for code-review, so I'd appreciate keeping the discussion somewhat high level.

Firstly, if someone wants to try this, you can just clone my branch, build the compiler with --cmake-opt="-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=SPIRV", and then compile SYCL programs with --offload-new-driver -foffload-lto=thin.

The main idea of the change is as follows:

We hook into the existing LTO framework in clang-linker-wrapper by adding a callback that runs right before the backend runs (so right before the SPIR-V backend would be called).

Before the callback is run, function importing happens automatically inside the thinLTO framework where we link in the definitions for any functions referenced in this module but defined in another. When the callback is run, the module passed in should have everything linked in and be ready for processing.

In that callback, we do all the required processing of the SYCL module. In the current prototype, that includes: Generating the module properties Generating the module symbol table Calling the SPIR-V translator

In the future, we will also need to handle at least spec constants and internalization of non-entry points.

The final step of the callback, calling the SPIR-V translator, should be removed when the SPIR-V backend is ready.

One problem is that we need to store the module properties, symbol table, and SPIR-V translator output file path until a later time where we are ready to create the final images, which requires all of this information.

We are also somewhat limited on what we can do to store the data, because the callbacks run on threads inside the LTO framework, and some data structures (like StringSaver) are not thread-safe. We need to do this information inside the LTO framework thread because that is the only point we have the IR with functions imported.

As a low-effort hack to get the prototype working, I'm just storing that information as std::string, which is pretty inefficient and gross.

I decided to store the information inside the OffloadBinary. Otherwise, we'd need to create and pass a SYCL-specific data structure all over the place, which seems bad. The current changes to OffloadBinary to add a temporary string vector is definitely disgusting, but I'm proposing we make some change to the class to allow us to store the information we need, but I don't have a good idea for a clean and thread-safe data structure to use. I don't have a better idea for a place to store all this information.

In terms of test results:

Here are the results when running E2E tests on PVC with --offload-new-driver only enabled:

Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1553 (69.95%)
  Expectedly Failed:   52 (2.34%)
  Failed           :   72 (3.24%)

And here are the results when running E2E tests on PVC with --offload-new-driver -foffload-lto=thin, so enabling thinLTO:

Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1466 (66.04%)
  Expectedly Failed:   52 (2.34%)
  Timed Out        :    1 (0.05%)
  Failed           :  158 (7.12%)

It seems we have a decent passrate, only 87 thinLTO specific E2E failures. I'm holding off investigating these until we finalize the architecture. Once it's finalized, I'll make a design document to be submitted to this repository.

Aug 14 '24 20:08 sarnex

I'd like to get some architecture and code-organzation level comments on my SYCL thinLTO prototype.

The feature isn't ready for code-review, so I'd appreciate keeping the discussion somewhat high level.

Let's create a design doc for thinLTO feature. It will be easier to have this conversation when the design and open questions are kept in a separate document versioned by Git.

The main idea of the change is as follows:

We hook into the existing LTO framework in clang-linker-wrapper by adding a callback that runs right before the backend runs (so right before the SPIR-V backend would be called).

Before the callback is run, function importing happens automatically inside the thinLTO framework where we link in the definitions for any functions referenced in this module but defined in another. When the callback is run, the module passed in should have everything linked in and be ready for processing.

In that callback, we do all the required processing of the SYCL module. In the current prototype, that includes: Generating the module properties Generating the module symbol table Calling the SPIR-V translator

In the future, we will also need to handle at least spec constants and internalization of non-entry points.

The final step of the callback, calling the SPIR-V translator, should be removed when the SPIR-V backend is ready.

One problem is that we need to store the module properties, symbol table, and SPIR-V translator output file path until a later time where we are ready to create the final images, which requires all of this information.

I would prefer if we all "the required processing" in standard LLVM pipeline.

Generating the module properties.
Generating the module symbol table.
Calling the SPIR-V translator. SPIR-V translator will be replaced by SPIR-V backend.
Handle at least spec constants. I'm not 100% sure I fully understand the scope of work we need for "handling", but assuming that it's requires just transform LLVM IR, this should be done by injecting an additional pass to the standard pipeline. Hopefully this doesn't require storing additional information outside of LLVM IR module.
Internalization of non-entry points. AFAIK, "internalization" is very common task. Some sorts of internalizations are done by LTO framework itself, NVPTX and AMDGPU targets. A couple of links to the source code: https://github.com/intel/llvm/blob/ee975312fd443e19edc5e0674fe693cb660255ae/llvm/lib/Transforms/IPO/Internalize.cpp#L43-L52, https://github.com/intel/llvm/blob/ee975312fd443e19edc5e0674fe693cb660255ae/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp#L213-L218.

We should consider doing steps 1 and 2 outside of LTO framework i.e. either before LTO or after. SPIR-V module has information about the properties and exported symbols. My understanding is that first two steps extracting this information from LLVM IR module into a data structure, which execution environment (in our case is SYCL runtime) uses to "handle" the optimized code at runtime (e.g. JIT compile and/or execute).

We are also somewhat limited on what we can do to store the data, because the callbacks run on threads inside the LTO framework, and some data structures (like StringSaver) are not thread-safe. We need to do this information inside the LTO framework thread because that is the only point we have the IR with functions imported.

As a low-effort hack to get the prototype working, I'm just storing that information as std::string, which is pretty inefficient and gross.

I decided to store the information inside the OffloadBinary. Otherwise, we'd need to create and pass a SYCL-specific data structure all over the place, which seems bad. The current changes to OffloadBinary to add a temporary string vector is definitely disgusting, but I'm proposing we make some change to the class to allow us to store the information we need, but I don't have a good idea for a clean and thread-safe data structure to use. I don't have a better idea for a place to store all this information.

In terms of test results:

Here are the results when running E2E tests on PVC with --offload-new-driver only enabled:
Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1553 (69.95%)
  Expectedly Failed:   52 (2.34%)
  Failed           :   72 (3.24%)
And here are the results when running E2E tests on PVC with --offload-new-driver -foffload-lto=thin, so enabling thinLTO:
Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1466 (66.04%)
  Expectedly Failed:   52 (2.34%)
  Timed Out        :    1 (0.05%)
  Failed           :  158 (7.12%)
It seems we have a decent passrate, only 87 thinLTO specific E2E failures. I'm holding off investigating these until we finalize the architecture. Once it's finalized, I'll make a design document to be submitted to this repository.

Aug 21 '24 20:08 bader

I'd like to get some architecture and code-organzation level comments on my SYCL thinLTO prototype. The feature isn't ready for code-review, so I'd appreciate keeping the discussion somewhat high level.

Let's create a design doc for thinLTO feature. It will be easier to have this conversation when the design and open questions are kept in a separate document versioned by Git.

The main idea of the change is as follows: We hook into the existing LTO framework in clang-linker-wrapper by adding a callback that runs right before the backend runs (so right before the SPIR-V backend would be called). Before the callback is run, function importing happens automatically inside the thinLTO framework where we link in the definitions for any functions referenced in this module but defined in another. When the callback is run, the module passed in should have everything linked in and be ready for processing. In that callback, we do all the required processing of the SYCL module. In the current prototype, that includes: Generating the module properties Generating the module symbol table Calling the SPIR-V translator In the future, we will also need to handle at least spec constants and internalization of non-entry points.

The final step of the callback, calling the SPIR-V translator, should be removed when the SPIR-V backend is ready. One problem is that we need to store the module properties, symbol table, and SPIR-V translator output file path until a later time where we are ready to create the final images, which requires all of this information.

I would prefer if we all "the required processing" in standard LLVM pipeline.

Generating the module properties.

Generating the module symbol table.

Calling the SPIR-V translator. SPIR-V translator will be replaced by SPIR-V backend.

Handle at least spec constants. I'm not 100% sure I fully understand the scope of work we need for "handling", but assuming that it's requires just transform LLVM IR, this should be done by injecting an additional pass to the standard pipeline. Hopefully this doesn't require storing additional information outside of LLVM IR module.

Internalization of non-entry points. AFAIK, "internalization" is very common task. Some sorts of internalizations are done by LTO framework itself, NVPTX and AMDGPU targets. A couple of links to the source code: https://github.com/intel/llvm/blob/ee975312fd443e19edc5e0674fe693cb660255ae/llvm/lib/Transforms/IPO/Internalize.cpp#L43-L52 , https://github.com/intel/llvm/blob/ee975312fd443e19edc5e0674fe693cb660255ae/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp#L213-L218 .

We should consider doing steps 1 and 2 outside of LTO framework i.e. either before LTO or after. SPIR-V module has information about the properties and exported symbols. My understanding is that first two steps extracting this information from LLVM IR module into a data structure, which execution environment (in our case is SYCL runtime) uses to "handle" the optimized code at runtime (e.g. JIT compile and/or execute).
We are also somewhat limited on what we can do to store the data, because the callbacks run on threads inside the LTO framework, and some data structures (like StringSaver) are not thread-safe. We need to do this information inside the LTO framework thread because that is the only point we have the IR with functions imported. As a low-effort hack to get the prototype working, I'm just storing that information as std::string, which is pretty inefficient and gross. I decided to store the information inside the OffloadBinary. Otherwise, we'd need to create and pass a SYCL-specific data structure all over the place, which seems bad. The current changes to OffloadBinary to add a temporary string vector is definitely disgusting, but I'm proposing we make some change to the class to allow us to store the information we need, but I don't have a good idea for a clean and thread-safe data structure to use. I don't have a better idea for a place to store all this information. In terms of test results: Here are the results when running E2E tests on PVC with --offload-new-driver only enabled:
Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1553 (69.95%)
  Expectedly Failed:   52 (2.34%)
  Failed           :   72 (3.24%)
And here are the results when running E2E tests on PVC with --offload-new-driver -foffload-lto=thin, so enabling thinLTO:
Total Discovered Tests: 2220
  Unsupported      :  543 (24.46%)
  Passed           : 1466 (66.04%)
  Expectedly Failed:   52 (2.34%)
  Timed Out        :    1 (0.05%)
  Failed           :  158 (7.12%)
It seems we have a decent passrate, only 87 thinLTO specific E2E failures. I'm holding off investigating these until we finalize the architecture. Once it's finalized, I'll make a design document to be submitted to this repository.

@bader, can you please clarify..'standard LLVM pipeline'? Sorry if the question is too naive. Wanted to check before adding my comments here. Thanks

Aug 27 '24 19:08 asudarsa

@bader, can you please clarify..'standard LLVM pipeline'? Sorry if the question is too naive. Wanted to check before adding my comments here. Thanks

I mean the pipeline built here: https://github.com/intel/llvm/blob/283073a4f21aa44203fd49e25b4ff0379c91b29a/llvm/include/llvm/Passes/PassBuilder.h#L264-L282

Aug 27 '24 22:08 bader

Thanks all for the feedback, I'll start replying/addressing it shortly.

Aug 28 '24 14:08 sarnex

@bader

Thanks for the detailed feedback.

Let's create a design doc for thinLTO feature. It will be easier to have this conversation when the design and open questions are kept in a separate document versioned by Git.

Okay, will work on it.

We should consider doing steps 1 and 2 outside of LTO framework i.e. either before LTO or after. SPIR-V module has information about the properties and exported symbols. My understanding is that first two steps extracting this information from LLVM IR module into a data structure, which execution environment (in our case is SYCL runtime) uses to "handle" the optimized code at runtime (e.g. JIT compile and/or execute).

Correct, we need to extract this information into a data structure which will be finally used by the runtime for many things, optimized args is one case. I originally wanted to do it inside the LTO framework so that we wouldn't have to look at the IR at all outside of the framework for performance reasons (thinLTO uses BitcodeModule which is lazy-load), but since we don't have one single module anymore and we shouldn't need functions linked in to compute this information, I think it's reasonable to do it outside the framework. Let me try this out.

Handle at least spec constants. I'm not 100% sure I fully understand the scope of work we need for "handling", but assuming that it's requires just transform LLVM IR, this should be done by injecting an additional pass to the standard pipeline. Hopefully this doesn't require storing additional information outside of LLVM IR module.

The problem with spec constants is that based on comments from @AlexeySachkov, we need the entire IR for the module to accurately do the processing, stubs for functions in other TUs won't cut it. The only time that is true with thinLTO is inside the thinLTO framework after function importing. @AlexeySachkov Let me know if I'm wrong about this requirement.

Internalization of non-entry points. AFAIK, "internalization" is very common task. Some sorts of internalizations are done by LTO framework itself, NVPTX and AMDGPU targets. A couple of links to the source code:

Yeah really good point here, we should just be able to make/modify a pass that is called by LTO. Let me work on this.

If these ideas work I think we can clean this up a lot, thanks again. Will report back when I have more info. Thanks again.

Aug 29 '24 19:08 sarnex

One problem is that we need to store the module properties, symbol table, and SPIR-V translator output file path until a later time where we are ready to create the final images, which requires all of this information.

Why do we need SPIR-V translator output path file to be generated at this stage? Why can't it be generated directly when we call the translator? Why do we even need a file here? The translator has library interface and if it lacks API that works on memory buffers, then let's just add such API in there so we don't have to deal with files at all.

We are also somewhat limited on what we can do to store the data, because the callbacks run on threads inside the LTO framework, and some data structures (like StringSaver) are not thread-safe. We need to do this information inside the LTO framework thread because that is the only point we have the IR with functions imported.

Could you please clarify why do we have a problem here at all? My assumption is that every thinLTO thread works on a separate translation unit and therefore produces a separate module and OffloadBinary object, i.e. it doesn't touch properties of other modules.

In any case, one more thing that we can do is to slightly refactor how we deal with those properties to delay recording them into OffloadBinary. What we can do is to record those properties into module named metadata at first and then extract when we need to write them down into the final device image data structure. This is essentially how sycl-post-link operates and my guess is that there should be a point in time where we perform a final wrapping of a single module - at that point we can read metadata and copy/move it to OffloadBinary.

The problem with spec constants is that based on comments from @AlexeySachkov, we need the entire IR for the module to accurately do the processing, stubs for functions in other TUs won't cut it. The only time that is true with thinLTO is inside the thinLTO framework after function importing. @AlexeySachkov Let me know if I'm wrong about this requirement.

That is correct, spec constants won't work if transformation is done before link phase. Reason for that is the feature design: in SYCL spec constants are identified by non-type template parameters (somewhat similar to kernels) and under the hood we use strings emitted for us by a special built-in similar to unique stable name. The pass then converts those strings into integers and records a mapping string -> integer. That is needed because at SPIR-V level spec constants are identified by their IDs. If we have two translation units which both use spec constants and they are linked together, then we can't perform string to ID replacement before link stage. Simply because it can happen that we will record two different spec constants to have the same integer ID. Theoretically, we could do a renumeration at this point, but there is also a case with default values where global (module-wide) order of spec constants handling matters and we won't be able to easily (or at all) re-enumerate them in that case, it is different from SPIR-V (native) path.

Note that spec constants is only one example, but there are (and will be) likely more passes that require fully linked module to work correctly.

Generating the module properties.

@bader, while we can prepare properties in standard LLVM pipeline, we can't generate their final representation within the pipeline, simply because they have to be stored within a wrapped object that is generated by clang-linker-wrapper and not by the optimization pipeline.

Internalization of non-entry points. AFAIK, "internalization" is very common task. Some sorts of internalizations are done by LTO framework itself, NVPTX and AMDGPU targets. A couple of links to the source code: Yeah really good point here, we should just be able to make/modify a pass that is called by LTO. Let me work on this.

Note that we already use standard LLVM pass for internalization and we launch it always, just with different "filters", see #14976

Sep 03 '24 07:09 AlexeySachkov

@AlexeySachkov Thanks for the feedback.

Why do we need SPIR-V translator output path file to be generated at this stage? Why can't it be generated directly when we call the translator? Why do we even need a file here? The translator has library interface and if it lacks API that works on memory buffers, then let's just add such API in there so we don't have to deal with files at all.

We don't need a file, but we need to somehow get the output of the SPIR-V translation process outside of the thinLTO thread and the mapping from input file vector to output file vector is not 1-to-1. I have an idea of how we can make that somewhat cleaner as well as address this feedback to use the library and not use files. Will work on this and propose it here.

Could you please clarify why do we have a problem here at all? My assumption is that every thinLTO thread works on a separate translation unit and therefore produces a separate module and OffloadBinary object, i.e. it doesn't touch properties of other modules.

In any case, one more thing that we can do is to slightly refactor how we deal with those properties to delay recording them into OffloadBinary. What we can do is to record those properties into module named metadata at first and then extract when we need to write them down into the final device image data structure. This is essentially how sycl-post-link operates and my guess is that there should be a point in time where we perform a final wrapping of a single module - at that point we can read metadata and copy/move it to OffloadBinary.

The main problem I noticed was trying to save StringRefs created on the thinLTO thread. To save StringRefs in general, we need StringSaver, but StringSaver is not threadsafe. I'm somewhat embaressed to say I didn't investigate the issue to the root-cause but just worked around it with the std::string hack. I think I can remove the hack while addressing feedback given here.

Saving it as metadata is a good idea, will try that. Not ideal, but should be a lot cleaner on the clang-linker-wrapper code.

That is correct, spec constants won't work if transformation is done before link phase. Reason for that is the feature design: in SYCL spec constants are identified by non-type template parameters (somewhat similar to kernels) and under the hood we use strings emitted for us by a special built-in similar to unique stable name. The pass then converts those strings into integers and records a mapping string -> integer. That is needed because at SPIR-V level spec constants are identified by their IDs. If we have two translation units which both use spec constants and they are linked together, then we can't perform string to ID replacement before link stage. Simply because it can happen that we will record two different spec constants to have the same integer ID. Theoretically, we could do a renumeration at this point, but there is also a case with default values where global (module-wide) order of spec constants handling matters and we won't be able to easily (or at all) re-enumerate them in that case, it is different from SPIR-V (native) path.

Note that spec constants is only one example, but there are (and will be) likely more passes that require fully linked module to work correctly.

My plan is to add passes to the thinLTO pipeline of passes to run using the PreCodeGenPassesHook field of LTOConfig, but unfortunately we need to use the old passes manager because that's all it supports. It seems it uses the old pass manager because a lot of the codegen passes still do. We should be able to add passes as needed here, and if the legacy pass manager stuff becomes too annoying we can make a single legacy pass for us to call that calls everything we need.

Overall, thanks a lot for the feedback, you've given me some ideas I should be able to use to clean up the implementation a lot.

Sep 03 '24 14:09 sarnex

My plan is to add passes to the thinLTO pipeline of passes to run using the PreCodeGenPassesHook field of LTOConfig, but unfortunately we need to use the old passes manager because that's all it supports. It seems it uses the old pass manager because a lot of the codegen passes still do. We should be able to add passes as needed here, and if the legacy pass manager stuff becomes too annoying we can make a single legacy pass for us to call that calls everything we need.

I suppose that it should be possible to wrap "new" passed into "legacy" interface so we can run them in both pipelines without having to duplicate them completely. At least the vice-verse wrapping was possible and that's what we used during transition to the new pass manager.

Sep 03 '24 14:09 AlexeySachkov

Note that spec constants is only one example, but there are (and will be) likely more passes that require fully linked module to work correctly.

In my opinion, this is a significant disadvantage of the design and should be avoided. There is an extension proposal with alternative API providing similar functionality w/o design limitations of spec constants API.

Sep 05 '24 03:09 bader

Note that spec constants is only one example, but there are (and will be) likely more passes that require fully linked module to work correctly.

In my opinion, this is a significant disadvantage of the design and should be avoided. There is an extension proposal with alternative API providing similar functionality w/o design limitations of spec constants API.

To clarify: by fully linked I only mean that further linking of two modules that have undergone spec constants processing can't be performed, because spec constants will be broken. It doesn't really require to link all kernels together. We just can't link them together afterwards.

I agree that this is a problem and it effectively prevents use of a spec constants defined within a shared library by an application that's linked to that library.

Sep 06 '24 15:09 AlexeySachkov

To clarify: by fully linked I only mean that further linking of two modules that have undergone spec constants processing can't be performed, because spec constants will be broken. It doesn't really require to link all kernels together. We just can't link them together afterwards.

Do you mean it would be semantically correct to run spec constants processing individually on each module, as long as we only do it once per output file? So if we had 2 cpp files input to clang each with 2 splits assuming arbitrary cross-TU dependencies and cross-TU dependencies not linked in, we'll have 4 output modules, so could we run the pass at any point in time on each module, as long as we only do it once? We wouldn't even need to run it inside thinLTO then, we could do it before, although it might be cleaner to do it there anyway.

Sep 06 '24 15:09 sarnex

Do you mean it would be semantically correct to run spec constants processing individually on each module, as long as we only do it once per output file? So if we had 2 cpp files input to clang each with 2 splits assuming arbitrary cross-TU dependencies, we'll have 4 output modules, so could we run the pass at any point in time on each module, as long as we only do it once?

Right.

We wouldn't even need to run it inside thinLTO then, we could do it before or after.

I'm not sure about before, because thinLTO may bring a function that uses spec constants to a kernel that also uses spec constants which will lead to integer IDs clash. After is perfectly fine, though. That's how it is currently implemented except that we have full instead of thing and no optimizations :)

UPD: if we assume that thinLTO doesn't change a module, i.e. it won't bring anything new to it, then yes, we can run a pass even before thinLTO

Sep 06 '24 15:09 AlexeySachkov

I'm not sure about before, because thinLTO may bring a function that uses spec constants to a kernel that also uses spec constants which will lead to integer IDs clash.

UPD: if we assume that thinLTO doesn't change a module, i.e. it won't bring anything new to it, then yes, we can run a pass even before thinLTO

Sorry, I edited my question a ton. It depends on the defintion of change, thinLTO will definitely change the module by importing function definitions and running optimizations. I think just using the PreCodeGenPassesHook lambda to call it from thinLTO where everything is linked in is the cleanest anyway.

Sep 06 '24 15:09 sarnex

Sorry, I edited my question a ton. It depends on the defintion of change, thinLTO will definitely change the module by importing function definitions and running optimizations.

Optimizations are not a concern, importing of new functions is. We can't link two modules which were processed by spec constants pass, so it has to be run after importing, but regardless of optimizations.

Sep 06 '24 15:09 AlexeySachkov

@bader @asudarsa @AlexeySachkov @RaviNarayanaswamy @mdtoguchi @maksimsab

Hey all, I just pushed a new commit with a design doc and updated prototype implementation based on the feedback. Please take a look when you have a chance. Thanks!

Sep 19 '24 17:09 sarnex

@sarnex, I've made minor editorial changes, but unfortunately, I can't push them to your branch. Your repository doesn't allow me to create a pull request to your private branch. If you are okay with the changes from these two commits, please, cherry-pick them to your private branch.

Sep 20 '24 00:09 bader

For some reason I can't reply inline

You can reply inline here: https://github.com/intel/llvm/pull/15083#discussion_r1769334803. When you publish a review with new discussions (1) and replies to already existing discussions (2), the replies (2) are posted in the original discussion where you can answer inline and duplicated in the review w/o the inline comment option.

Sep 23 '24 17:09 bader

Fixed, thx. Github UI confused me.

Sep 23 '24 17:09 sarnex

I'm planning to wait until we implement the new SYCL offloading design before picking this back up, as none of the code I added here will be run at all in that flow.

Oct 02 '24 15:10 sarnex

This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

Apr 08 '25 02:04 github-actions[bot]

This pull request was closed because it has been stalled for 30 days with no activity.

May 08 '25 02:05 github-actions[bot]