LTO prevents SimCore Factory from creating objects
Describe the bug The SimCore plugin factory is found unaware of declared plugins when LTO is enabled.
To Reproduce Steps to reproduce the behavior:
just configure-clang-lto
just build
just fire SimCore/test/basic.py
and then you see
[ fire ] 0 fatal: [SimFactory] : An object named simcore::TrackerSD has not been declared.
at /home/tom/code/ldmx/ldmx-sw/SimCore/include/SimCore/Factory.h:274 in make
Desired behavior It'd be cool to be able to enable LTO.
Additional context So, there are actually two different plugin factories in ldmx-sw.
- Framework/PluginFactory is a C-style plugin manager that forces registration to happen at library-load-time using compiler attributes (like
__attribute(init_priority(500))and__attribute__(constructor(1000))) - SimCore/Factory is a templated C++17 style factory that uses cleverly-placed
staticvariables to force certain behaviors at library-load-time.
The Framework/PluginFactory appears to be working when LTO is enabled since we get to attempting to create the TrackerSD. This implies that the Framework/PluginFactory was already used to declare and create the simulation processor. This causes me to focus attention on SimCore/Factory which works out since I originally designed it. I placed some printouts including the address of the Factory so we can observe the separate Factories being created for the different types. This is what I see.
denv fire SimCore/test/basic.py
---- LDMXSW: Loading configuration --------
Factory(0x7ff2238562b8)
Factory(0x7ff2238562b8): declare simcore::BertiniAtLeastNProductsModel
Factory(0x7ff2238562b8): declare simcore::BertiniModel
Factory(0x7ff2238562b8): declare simcore::BertiniNothingHardModel
Factory(0x7ff2238562b8): declare simcore::BertiniSingleNeutronModel
Factory(0x7ff2238562b8): declare simcore::NoPhotoNuclearModel
Factory(0x7ff22383d4b0)
Factory(0x7ff22383d4b0): declare simcore::EcalSD
Factory(0x7ff22383d4b0): declare simcore::HcalSD
Factory(0x7ff22383d4b0): declare simcore::ScoringPlaneSD
Factory(0x7ff22383d4b0): declare simcore::TrackerSD
Factory(0x7ff22383d4b0): declare simcore::TrigScintSD
Factory(0x7ff223816538)
Factory(0x7ff223816538): declare simcore::generators::GeneralParticleSource
Factory(0x7ff223816538): declare simcore::generators::LHEPrimaryGenerator
Factory(0x7ff223816538): declare simcore::generators::MultiParticleGunPrimaryGenerator
Factory(0x7ff223816538): declare simcore::generators::ParticleGun
---- LDMXSW: Configuration load complete --------
---- LDMXSW: Starting event processing --------
[ RunManager ]: Parallel worlds physics list has been registered.
[ RunManager ]: Parallel worlds have been enabled.
Factory(0x7ff2270bcfe0)
Factory(0x7ff2270bcfe0): make simcore::TrackerSD
[ fire ] 0 fatal: [SimFactory] : An object named simcore::TrackerSD has not been declared.
at /home/tom/code/ldmx/ldmx-sw/SimCore/include/SimCore/Factory.h:274 in make
~Factory(0x7ff2270bcfe0)
~Factory(0x7ff223816538)
~Factory(0x7ff22383d4b0)
~Factory(0x7ff2238562b8)
So LTO does not prevent the declaration process from happening (which is what I assumed was going wrong), but - for some reason - a new Factory is created when attempting to create the TrackerSD instead of using the one that was already created during library-load for declaration. My guess is that I'll need to read-up on the actual meaning of static variable in the context of a static class member function. Oofdah.
I'll probably try to create a separate testing area since it takes so long to recompile after changing the templated Factory.h header. See my gist about this Factory for more links and probably where I'll put results.
I think you will see a similar effect if you build with clang instead of GCC
I think you will see a similar effect if you build with clang instead of GCC
Yes, I can confirm that. But takes similarly long time.
I was able to create a very small repo reproducing the error: https://github.com/tomeichlersmith/ldmx-factory-test-bench much faster to compiler and play around with :smile:
I thought I was able to figure out what was happening. Look at the repo I created above for the details, but the TLDR is that static variables are library-local unless we explicitly tell the linker to allow for dynamic symbols (i.e. sharing/exporting symbols) such that the library we load can share the same singleton as the executable. I don't know why this was only showing itself as an error when trying to enable LTO since I also can observe the error without LTO.
However, while this appears to replicate the issue observed here and is resolved by adding the ENABLE_EXPORTS property to the executable, applying this strategy here does not work. I made the following edits and then tried rebuilding with clang and LTO and the same error occurrs.
diff --git a/Framework/CMakeLists.txt b/Framework/CMakeLists.txt
index e6e30670..10e1abb8 100644
--- a/Framework/CMakeLists.txt
+++ b/Framework/CMakeLists.txt
@@ -81,7 +81,8 @@ set_target_properties(
Framework
PROPERTIES CXX_STANDARD 17
CXX_STANDARD_REQUIRED YES
- CXX_EXTENSIONS NO)
+ CXX_EXTENSIONS NO
+ ENABLE_EXPORTS TRUE)
# the following writes the LinkDef file using the CMake global variables
# namespaces and dict that are lists appended to by the register_event_object function
@@ -128,6 +129,7 @@ install(FILES ${CMAKE_CURRENT_BINARY_DIR}/libFramework_rdict.pcm DESTINATION lib
add_executable(fire ${PROJECT_SOURCE_DIR}/app/fire.cxx)
target_link_libraries(fire PRIVATE Framework::Framework)
install(TARGETS fire DESTINATION ${CMAKE_INSTALL_PREFIX}/bin)
+set_property(TARGET fire PROPERTY ENABLE_EXPORTS TRUE)
# Setup the test
setup_test(dependencies Framework::Framework)
diff --git a/SimCore/CMakeLists.txt b/SimCore/CMakeLists.txt
index c80435e4..b66eed99 100644
--- a/SimCore/CMakeLists.txt
+++ b/SimCore/CMakeLists.txt
@@ -92,7 +92,8 @@ setup_library(module SimCore
set_target_properties(SimCore
PROPERTIES CXX_STANDARD 17
CXX_STANDARD_REQUIRED YES
- CXX_EXTENSIONS NO)
+ CXX_EXTENSIONS NO
+ ENABLE_EXPORTS TRUE)
setup_python(package_name LDMX/SimCore)
diff --git a/cmake/BuildMacros.cmake b/cmake/BuildMacros.cmake
index 3716712e..c2bd1f55 100644
--- a/cmake/BuildMacros.cmake
+++ b/cmake/BuildMacros.cmake
@@ -143,6 +143,7 @@ macro(setup_library)
enable_compiler_warnings(${library_name})
enable_ipo(${library_name})
enable_clang_tidy(${library_name} ${WARNINGS_AS_ERRORS})
+ set_property(TARGET ${library_name} PROPERTY ENABLE_EXPORTS TRUE)
# Define an alias. This is used to create the imported target.
set(alias "${setup_library_module}::${setup_library_module}")
https://medium.com/@yakupcengiz/how-static-variables-behave-across-multiple-libraries-14b2a94259a8
I am back at the stage of stumbling around trying dumb things and I have stumbled into something that appears to work. Compiling the SDs into SimCore allows the factory to function.
Currently, we compile SDs into a separate library libSimCore_SDs.so that is then loaded. If we abandon this isolation and instead compile them into libSimCore.so with the tools where the Factory::make call is made, then we successfully register and create the SDs (and then fail at the next dynamically registered object that is compiled into a separate area).
I managed to find a way that only works with clang and not with GCC which is fun and is enough for today.
So, I believe the core of the problem is the very intricate dependency hierarchy within SimCore. In the current state, GCC is able to resolve the various factories into their singletons when linking so that it functions when loading the dynamic libraries and Clang is not able to do this. I am unable to replicate such an intricate network of includes in my simpler repo and so I am reduced to just trying different options here.
- one mega library
- better organize the hierarchy of includes
I would prefer option (2) but I don't know if its possible so I'm going to try it first.
Alright, being more consistent with the dependency flow appears to be working :tada: but I haven't checked that the Biasing module still works given that the base UserAction class is necessarily compiled into SimCore.
Update 😢
More consistently defining the dependency tree allowed for the factory-objects from within SimCore to be functional for both GCC and Clang, but this does not resolve the Biasing UserActions which still only work with GCC.
Update 2 :tada:
Defining the static Factory in the Prototype class's TU allows for there to be a single definition that is shared across all the libraries. I will keep testing as I clean up the code, but I am cautiously optimistic.
To Do
- [x] make sure Biasing UserActions still function with both compilers
- [x] since base classes have been moved into their libraries, put them in the correct namespace (e.g.
XsecBiasingOperatorshould be insimcore::biasoperators::) - [x] can we remove the
ENABLE_EXPORTS TRUE? I just want to simplify the CMake if its not needed and I believe its not needed because of my understanding of how libraries link (they already export symbols). - [x] remove trace debugging and static counter from the Factory class
- [x] undo justfile edits that were just for me wanting to test two different compilers at the same time
- [x] replace all instances of a factory with Factory? (mainly would be moving SimCore/Factory into Framework and replacing the PluginFactory with it).
Now that this is resolved, I think we should consider the LTO build to be part of the default compilation, what do you think @tomeichlersmith ?
@EinarElen ? I think the idea is that it would provide run-time speed improvements? But I'm unsure on how much change we would see given most of our CPU time is taken by Geant4...
There are two things that LTO would do for us:
One is that some bugs that are really nasty can really only be diagnosed with it (ODE violations)
The other is the perf improvement opportunities. And a big part of geant4s execution is the user stepping actions, which are on our side of things.
The price you pay is that linking becomes much slower which can be painful when developing
We should definitely consider LTO for production. I would even suggest building production images with Geant4 itself built with LTO
OK then maybe the status quo is fine, where we have a dedicated LTO CI but not make it the default!