FCCAnalyses [WIP] EDM4hep DataSource

DataSource providing EDM4hep collections. This allows writing analyzers using EDM4hep collections directly. Example:

    struct selPDG {
      selPDG(const int pdgID, const bool chargeConjugateAllowed = false);
      const int m_pdg;
      const bool m_chargeConjugateAllowed;
      edm4hep::TrackCollection operator() (
          const edm4hep::MCRecoTrackParticleAssociationCollection& inAssocColl);
    };

Two implementations are provided, frame and legacy. To run the analysis using these analyzers --use-source or --use-legacy-source.

Implementation: The class keeps three vectors m_Collections, m_podioReaders and m_frames. Thread safety: questionable

Aug 07 '23 13:08 kjvbrt

This will be very nice to have, thanks! I'll take a closer look this week. For the naming, I'd suggest "DataSource" instead of "Source". This should also eventually move to the edm4hep repository, right?

Aug 14 '23 07:08 vvolkl

Thanks for the suggestion I renamed the Source to DataSource.

Aug 15 '23 09:08 kjvbrt

Can we split the DataSource itself off of this and get it into edm4hep directly? I tested it a bit and it looked ready to use?

Apr 11 '24 07:04 Zehvogel

If putting the datasource into EDM4hep would be (more or less) easily possible, I would also be in favor of having it there. That should make maintenance a bit easier and we would see earlier if we break it when we do changes in EDM4hep.

Apr 11 '24 07:04 tmadlener

I only had a quick look at some of the files and made a few smaller comments.

Thanks, this was just work in progress commit...

Overall, what is the (longer term) plan for FCCAnalysis for switching to the podio datasource? This PR suggests that at least for some time the "old way" and the "new way" will coexist for some time. Since EDM4hep has changed enough such that newer files cannot be easily read with older versions of FCCAnalysis, I am wondering whether it makes sense to have the two things in parallel, or whether it would be easier to simply make a clean cut and move everything to the data source way.

I would like to ditch "old way" at some point, but I'm not sure we can make clean cut quickly. There is a lot of analysis code which uses the "old way" and also the performance of the datasource way needs to be better understood.

Some of the functionality might also be general enough for it to be moved to EDM4hep? E.g. getting the pt (or four momentum) for any datatype that has a momentum (and at least a mass or energy).

Upstreaming some of the functionalities to edm4hep is the good point. How would you handle generating of those functions? With C++ templating or something like Podio(jinja2)? Long term I thing it would be good to separate c++ libraries from Python orchestration and only at the point where we know which files the analysis will operate on we source appropriate stack.

Sep 20 '24 13:09 kjvbrt

also the performance of the datasource way needs to be better understood.

I think unfortunately the performance will always be worse. Podio gives us an AoS style access in memory while when we access the root file directly we have SoA. I.e. using the podio datasource will potentially always result in reading also unneeded data. Unless I am confusing something... Maybe @m-fila wants to comment? :)

Sep 20 '24 14:09 Zehvogel

There is a lot of analysis code which uses the "old way" and also the performance of the datasource way needs to be better understood.

Good point. Didn't think of that.

I think unfortunately the performance will always be worse. Podio gives us an AoS style access in memory while when we access the root file directly we have SoA. I.e. using the podio datasource will potentially always result in reading also unneeded data. Unless I am confusing something... Maybe @m-fila wants to comment? :)

There are a few things to consider. The major difference is that the podio data source currently has no way to only read selective branches / collections only. It will always read the full event. On top of that reconstructing the relations takes a bit of time. At some point that might be an acceptable overhead, but how large the analysis has to be for that is unclear to me. There are also almost certainly some optimizations that can be (or are already) done if the podio middleman is cut out of the whole thing. The memory layout is another thing, but that might even be something we can change in podio, because the interface should give us enough abstraction to switch the implementation below from AoS to SoA, but this has to be investigated further.

Sep 20 '24 14:09 tmadlener