Implement Schema Evolution for Event Objects
The clang-tidy renaming of event object members means that ldmx-sw >= 4.4.8 is unable to read the files that were generated with ldmx-sw <= 4.4.7. This is annoying and a surprise to many of us since our schema evolution (in the past) has been able to be handled auto-magically by ROOT's internal schema evolution detecting algorithm.
Error Message
If you run into this, you'll see many errors from TBranchElement::InitializeOffsets complaining about specific member variables "missing" when constructing the branch. For example:
Info in <TBranchElement::InitializeOffsets>: TTree created with an older schema, some data might not be copied in 'slow-cloning' mode; fast-cloning should have the correct result. 'trackID_' is missing when constructing the branch 'TrigScintScoringPlaneHits_eat'.
In the short term, you should drop back to using v4.4.7.
Solution
The solution is to actually implement schema evolution and the documentation has improved since the last time I looked at it, giving me more hope.
Main Reference: https://root.cern/manual/io/#dealing-with-changes-in-class-layouts-schema-evolution
This reference gives us many details that are helpful, but in short, we need to update the LinkDef file that specifies which classes are to be treated as event objects with additional "reading rules" that customize how the object currently in memory is to be updated with the information from the object residing on disk. There are other options, but the documentation discourages using the C++ API for Schema Rules, custom TClass, or custom TStreamer because they are more complicated for most use cases (I assume).
Plan Outline
Currently, we have CMake generate a single LinkDef file after "registering" classes in various CMakeLists.txt. I don't want to update our CMake "registration" with this custom rule stuff and so I'm instead going to abandon CMake generating the LinkDef and have the LinkDefs committed into the repository. I actually have done this before (when I was experimenting with HDF5 writing fire), so this is not that big of a hurdle. Additionally, it will remove the very confusing double-pass nature of our current CMake infrastructure which few (if any) other developers are aware of. Completed by #1857
Once the LinkDef files are in their Event folders and things are working in that mode, I can then go through the git history to find the source evolution of the event objects. This will inform my decision on minimum-supported event object versions (mapped to minimum-readable ldmx-sw version used to generate) and then I can start writing these reading rules. The good news is that re-naming variables is probably the simplest case of schema evolution that is not able to be handled automatically.
Alternative Solution
We could never do schema evolution and instead have v4.4.8 be a backwards-compatibility-breaking release. I don't like this since I think we will want to continue to evolve the schema and - upon further reflection and talking with other developers - I think maintaining readability of past samples is very important for validation and comparison to newer samples. I also think we will have other schema evolutions that are more complicated and thus more worthy of being backwards-breaking.
- [x] compile with refactored LinkDef
- [x]
run_test - [x] run validation samples with dev build
- [x] run with build directory moved (make sure installation is functional like in production image)
- [x] remove
link offlines that are ignored in ROOT 6+
Other notes
- go back to v3.3.3? or v3.0.0? not certain on a good point
- the only thing that matters is "class layout" (i.e. member variables), we've increased the class version on any change, so I think many of the class versions can be treated with the same "rule"
- need to make sure that the auto-evolution happens in addition to any rules I add
I've started working on actually evolving the schema following ROOT's instructions using #pragma read statements in the LinkDef files based off the re-factored branch in #1857 and I have some updates.
Set Up
I generated a small test file that only has the RunHeader and EventHeaders in it with:
# outside ldmx-sw
denv init ldmx/pro:v4.4.0
denv fire config.py
where config.py is
from LDMX.Framework import ldmxcfg
p = ldmxcfg.Process('gen')
p.maxEvents = 10
p.outputFiles = [ 'output.root' ]
I copy this output.root file to ldmx-sw for easy access.
I test reading this file with the updated version of ldmx-sw with the following configuration.
import sys
from LDMX.Framework import ldmxcfg
p = ldmxcfg.Process('ana')
p.inputFiles = sys.argv[1:]
p.sequence = [ ]
p.outputFiles = ['new-copy.root']
p.termLevel = 0
Run from within ldmx-sw with
just fire read.py output.root
Checking what the RunHeader looks like without ldmx-sw can be done with a Python script.
import sys
import json
import uproot
import numpy as np
def get_first_run_header(f):
with uproot.open({f: 'LDMX_Run'}) as run_tree:
runs = run_tree.arrays(library='np', how=dict)
return {
key: val[0]
for key, val in runs.items()
}
class NDArrayJSONEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, np.ndarray):
return list(o)
elif isinstance(o, np.int32):
return int(o)
return super().default(o)
print(json.dumps(get_first_run_header(sys.argv[1]), cls=NDArrayJSONEncoder, indent=2))
Run like
denv python3 print-run-header.py <file.root>
For reference, the output.root RunHeader looks like
tom@appa:~/code/ldmx/ldmx-sw$ denv python3 print-run-header.py output.root
{
"runNumber_": 1,
"detectorName_": "",
"description_": "",
"runStart_": 1759519280,
"runEnd_": 1759519280,
"numTries_": 10,
"softwareTag_": "0456db93e0a5faf1ef3a786b8cacd11bc2292951",
"ldmxsw_version_": "v4.3.6",
"intParameters_.first": [
"RandomNumberMasterSeed[gen]"
],
"intParameters_.second": [
1
],
"floatParameters_.first": [],
"floatParameters_.second": [],
"stringParameters_.first": [
"Pass = gen, version"
],
"stringParameters_.second": [
"v4.3.6"
]
}
(Notice that the version number is wrong because we hadn't patched it yet at the time.)
The Good
Re-naming of "simple" types[^1] works easily "out of the box". For example, I can add the following read statement to Framework/LinkDef.h and the referenced RunHeader member variables are appropriately copied to their new names from a file generated with the old names.
#pragma read \
sourceClass="ldmx::RunHeader" \
source="int runNumber_; std::string detectorName_; int runStart_; int runEnd_; int numTries_; std::string softwareTag_;" \
version="[-5]" \
targetClass="ldmx::RunHeader" \
target="run_number_, detector_name_, run_start_, run_end_, num_tries_, software_tag_, int_parameters_" \
code="{ run_number_ = onfile.runNumber_; detector_name_ = onfile.detectorName_; run_start_ = onfile.runStart_; run_end_ = onfile.runEnd_; num_tries_ = onfile.numTries_; software_tag_ = onfile.softwareTag_; }"
and then the new-copy.root file generated with this version of ldmx-sw has those members with the correct values.
tom@appa:~/code/ldmx/ldmx-sw$ denv python3 print-run-header.py new-copy.root
{
"run_number_": 1,
"detector_name_": "",
"description_": "",
"run_start_": 1759519280,
"run_end_": 1759519280,
"num_tries_": 10,
"software_tag_": "0456db93e0a5faf1ef3a786b8cacd11bc2292951",
"ldmxsw_version_": "v4.3.6",
"int_parameters_.first": [
"RandomNumberMasterSeed[ana]"
],
"int_parameters_.second": [
1
],
"float_parameters_.first": [],
"float_parameters_.second": [],
"string_parameters_.first": [
"Pass = ana, version"
],
"string_parameters_.second": [
"v4.5.2"
]
}
Notice that the members that are not referenced (the std::map *Parameters_ classes) only have the stuff from the second ana pass and drop the information from the first gen pass. This is in line with the documentation. Those members in memory are treated as "new" and are default-constructed during the read. The members on disk are treated as "deleted" and are ignored during the read.
[^1]: Notice that "simple" includes std::string. This is encouraging.
The Bad
In my internet sleuthing, I found https://root-forum.cern.ch/t/manual-schema-evolution-with-i-o-rules-and-branches-containing-vector-t/64026 which makes me concerned that many of our trees will require more care because we have a lot of branches that are std::vector of custom objects. Haven't tested this yet, just noting it.
The referenced issue is not a problem for RNTuple, so this might push us into transitioning to RNTuple and implementing a separate read+copy mechanism to transition old data stored as TTree from the old classes into new data stored as RNTuple with the new classes.
The Ugly
Attempting the simple re-name solution that works for simple types on the std::map classes creates a seg-fault in code that ROOT auto-generates during dictionary creation. This leaves me with a few options:
- Avoid referencing the
std::mapon disk as astd::mapand try to parse its members manually (Currently investigating but the address arithmetic is ugly). - Write a custom
TStreamer. This is more address arithmetic and thus would probably be difficult for me to do properly. - Have a intermediate class in memory that is a copy of the old Run/Event Header class, read into that class and then copy the data into the new class after it has been brought into memory. This potentially has performance issues, but would probably be easier for me to implement than the custom
TStreamer
The good news is that the only other event class that has a std::map in it is the SimCore/EventWeights.h which is only present in GENIE runs and therefore is less prevalent.
Alright, I've got a hacked-together solution for the RunHeader which basically uses a copy of the parameters objects using the old member variable name to read them from the file and then the schema evolution rule copies that object to the new name. This is annoying because we are carrying (and then writing) two copies of the parameters_ members (e.g. both int_parameters_ and intParameters_) are written to the output file, but this at least does not lose data and then we can abandon the old format at a later date by removing the intParameters_.
However, this same strategy does not work with the EventHeader. I have a suspicion that it is due to sharing addresses between the input and output trees. Because the input tree has the old schema, only the old schema is read and written in this "slow cloning" mode. More investigation to come...
https://github.com/LDMX-Software/ldmx-sw/blob/5385124f91abc6a5186ba5cad76d352df08f2566/Framework/src/Framework/EventFile.cxx#L202
https://github.com/LDMX-Software/ldmx-sw/blob/5385124f91abc6a5186ba5cad76d352df08f2566/Framework/src/Framework/EventFile.cxx#L312
I started a ROOT Forum question about the std::map stuff. https://root-forum.cern.ch/t/schema-evolution-renaming-a-std-map-member-variable/64310
I'm not sure if the EventHeader stuff is related or not, but we will see. If it's not, I'll probably make another ROOT forum post about this "slow cloning mode + schema evolution" stuff, but I'll want to construct a reproducible example somehow.
Repository for small-scale replication of schema evolution issues I'm observing to try to motivate other Framework changes: https://github.com/tomeichlersmith/ldmx-root-schema-evolution-testbench
Unfortunate news: there are two issues that are blocking my progress on implementing schema evolution and forcing me to seriously consider just dropping the event objects back to their old names and using // NOLINT(readability-*) to avoid their names being changed in the future.
- The renaming of the
std::mapmember variables appears to be an issue within ROOT itself. https://its.cern.ch/jira/browse/ROOT-9862 This means even if it gets patched in a timely manner, we'd need to move to a new version of thedevimage with the updated (patched) version of ROOT. This issue alone would not have stopped me, we've pushed users to new images before and it works fine. One of the reasons using images is helpful. - Applying schema evolution while slow-cloning is quietly failing. https://root-forum.cern.ch/t/quiet-writing-failure-when-applying-schema-evolution-while-slow-cloning/64350 This is the bigger issue in my opinion. I can apply schema evolution and write out updated versions only if I don't use
CloneTree. WithoutCloneTree, the branches that are not "observed" (i.e. interacted with directly viaevent.get*) would not be copied. This would be a major change in behavior of ldmx-sw Framework. In the past, we have always copied all branches by default in order to be conservative and avoid "losing" data by keeping many copies of it around.
I'll keep watch on my ROOT forum posts, but I'm not optimistic. The second (bigger) one has gone quiet after a few comments from a ROOT developer which did not lead to a solution.
seriously consider just dropping the event objects back to their old names and using // NOLINT(readability-*) to avoid their names being changed in the future.
how about the other way around, keeping everything as is, and not support schema evolution, enforce that from now on, the member variables follow the rules, so we never get into this issue again?
I don't want to do that because then folks are unable to utilize files generated by past ldmx-sw. Sometimes these files took a long time to generate and validate (e.g. the leftovers from a 4e14 Ecal PN sample) and so dropping support for them would force a lot of development to occur based off an old version of ldmx-sw.
i.e. I'd only want to do that if we can faithfully abandon support for all files generated with ldmx-sw <= v4.4.7. This abandonment would require more buy-in from other developers which we haven't gotten.
That's fair, but I really think it's time to generate a new EcalPN anyway (with v15)
I also think we should generate a new Ecal PN, but I think it's also vital to maintain readability of old files. If only for the short-term goal of validating and comparing the v15 Ecal PN. Running the same DQM code on both the old and new samples is the best way (in my opinion) to validate and compare and this can only happen if both samples are readable.
If only for the short-term goal of validating and comparing the v15 Ecal PN.
I assume the validations dont require E14 events, if it's a reasonable stats then one can just produce a sample with v14 geom vs v15 geom, and in that we will have the same ldmx-sw version too, so it will really be the apple to apple comparison.
🎃 good halloween news 🎃
I think I am able to do the copy (including schema evolution) as long as I load the ROOT dictionary for all of the event objects in the input TTree[^1]. Avoiding CloneTree and CopyAddresses and instead implementing my own (manual) form of address syncing appears to be working in my small test cases. https://root-forum.cern.ch/t/quiet-writing-failure-when-applying-schema-evolution-while-slow-cloning/64350/11
I will look into mapping this solution into ldmx-sw and testing more complicated schema evolutions next week.
[^1]: This is not a high hurdle to clear. In fact, before #1857 we always loaded all of the event object dictionary entries because they were all compiled into the same dictionary that was loaded with Framework. I think I can just update the ldmxcfg.py to include all of the *_Event libraries.
that's great news, @tomeichlersmith ! i strongly prefer a solution where we fix what was broken, in this case, over enforcing migration to a new version across the board.
some text driving home points you (@tomeichlersmith @tvami @vdutta) will already know/agree with to varying degree:
as you are all aware, i worry about the maintainability, stability and longevity of our code base, and the associated data, a lot. i do this out of respect for people's time. if someone has worked hard -- even if slowly by someone else's standards -- on implementing an analysis strategy, a selection algorithm, or similar, it is really frustrating to not be able to use it or share it to our code base because the ground is constantly shifting under one's feet. similarly, the last samples to be produced in a production cycle still take a considerable amount of time to analyze, and shouldn't age too fast.
breaking backwards compatibility is unavoidable in a long-term project. it is also highly undesirable in a long-term project, for the above reasons, but also because it leads to reprocessing of an ever-increasing amount of data. so: we need to pick our few, breaking changes with care. however neat, enforcing a naming convention is in my mind in itself not a worthy cause. the situation we are in now has to be resolved, and i'm sure it will be, without resorting to either scrapping all old data or scrapping the variable renaming. i'm really thankful for the great work you are putting into finding a solution, Tom.
so in the bigger picture, how do we choose these few occasions where it is, in fact, worth it? this is not one person's call; any breaking change needs to be discussed in a larger group. a sw-dev meeting is a good forum for this. and, ideally we notice beforehand when a breaking change is introduced.
some things that might help spot breaks
i think there are lessons to be learned from this surprising break of backwards compatibility.
- review of major code overhauls needs to involve more than the 2 people who develop on the most frequent basis. for now i suggest that a sw/comp coordinator is pulled in as well. then we can't complain after the fact ;)
- we are using our validation suite which uses analyzers, and they don't care about the name of a member variable returned by a getter. this points to larger stability on the analysis side from staying within the framework for as long as possible. indeed this is the part we are and can be held responsible for, and thus is the use case we can be expected to always support. i do think we need to consider more seriously what would need to be different for most people to want to use the framework for making analysis histograms.
- can we include opening a file produced with an old version of ldmx-sw (from the latest breaking change version or so) as a test in our CI? in contrast to gold histograms, it doesn't need to contain more than 2 events but should include as many collections as possible, and catch when the code in a PR makes old data unreadable
- actually, bonus points for opening that same file with uproot (or whatever ships with the container) and seeing if updates on that end need to be migrated into the CI validation script --> this could become a recipe file for people wanting to use uproot (much like I use the CI configs to show students how to run simulation etc)
Thank you for these words @bryngemark - you are correct, I agree with them. A few follow-up comments:
major code overhauls
I'm not opposed to bringing in more reviewers. What is confusing to me is the word "major". In some sense, we got into this mess because neither @tvami nor I considered applying our variable naming rules as "major". It seemed like a minor change, one that I did not recognize as breaking until I tried to use my own old samples. You can actually view my notes from when I discovered this breaking change which I find slightly entertaining if naive: "Tried this because maybe the clang-tidy run..." (emphasis added)
can we include opening a file produced with an old version of ldmx-sw
I am in full support of this. In fact, I am hoping to develop such a producer within https://github.com/tomeichlersmith/ldmx-root-schema-evolution-testbench for the time being so it can reside within a different (old ldmx-sw verison) denv. I can commit the old file into ci-data and then include it as part of our tests.
bonus points for opening the same file with uproot
Currently, there are a few members of a few event objects that are unable to be interpreted by uproot. The complete list (to my knowledge) is
ldmx::Track::trackStates_(nowtrack_states_) observed by me in the ldmx-eat-tracking repoldmx::HgcrocPulseTruth::compositePulse_.hits_(nowcomposite_pulse_.hits_) observed in https://github.com/LDMX-Software/ldmx-sw/pull/1740
Both of these situations have more nesting than is common in our event objects. Specifically, they both have structure stored within a std::vector that is itself a member of a structure that is stored in a std::vector. When serializing this doubly-nested branch, ROOT appears to fall into member-wise splitting which is not supported by uproot https://github.com/scikit-hep/uproot5/issues/38 . I am unsure if this member-wise splitting could be avoided with changing parameters to TTree when we create the branches or if its inherent in serializing such a complicated structure. I am also unsure if this problem is applicable to RNTuple as well.
In any case, we can certainly check all of the other branches, again leaving some pseudo-documentation of how to access LDMX event files from uproot.
Observed a ROOT bug when attempting to evolve a std::map (like how our SimParticles are stored). https://github.com/root-project/root/issues/20421
Just wanted to put in this update as I step away to focus on slice test work. I will return to this in December. ROOT is planning to release 6.38.00 by the end of November so keep your fingers crossed that this bug is patched in that release. If it isn't, I will look at dropping the SimParticle to its old member variable names for the time being.
Probably should be it's own issue, but I think there would be value in not having the SimParticle as a map in the first place :D
(but that's breaking even more compatibility I guess, as LK mentioned about not sure when it's a good time to do that [but maybe now is since we already broke a lot of stuff? maybe never...?])