cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Crash from "DaqProvenanceHelper" when file created with "ProducerSourceFromFiles"

Open pfs opened this issue 1 year ago • 20 comments

For HGCAL test beam activities we are using a class derived from FWCore/Sources/interface/ProducerSourceFromFiles.h

We have followed the guidelines from https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideInputSources and are able to create a standard FEDRawDataCollection.

However when the output EDM file is re-used as input for another process the process crashes before processing any event with the following error

cmsRun: src/FWCore/Sources/src/DaqProvenanceHelper.cc:127: void edm::DaqProvenanceHelper::fixMetaData(std::vector<edm::ProcessConfiguration>&, std::vector<edm::ProcessHistory>&): Assertion `!newPCs.empty()' failed.

A minimal working example to reproduce the crash is here

cmsrel CMSSW_14_1_0_pre4
cd CMSSW_14_1_0_pre4/src
cmsenv
git cms-merge-topic -u pfs:test/producer_crash
scram b -j 8
cd UserCode/ProducerFromFileTest
sh test/run_test.sh

Unsure if this should be marked as an issue or sent as a question on cmsTalk, so apologies if I should use the former.

Tagging also @hqucms @IzaakWN @yulunmiao @ywkao

pfs avatar Jun 04 '24 21:06 pfs

cms-bot internal usage

cmsbuild avatar Jun 04 '24 21:06 cmsbuild

A new Issue was created by @pfs.

@sextonkennedy, @smuzaffar, @antoniovilela, @rappoccio, @Dr15Jones, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Jun 04 '24 21:06 cmsbuild

assign core

makortel avatar Jun 04 '24 21:06 makortel

@wddgit

makortel avatar Jun 04 '24 21:06 makortel

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Jun 04 '24 21:06 cmsbuild

I'll take a closer look at this tomorrow.

wddgit avatar Jun 04 '24 22:06 wddgit

@pfs

I can reproduce the assert failure. I'll continue working on that. I've only been in this part of the code once to fix a bug 4 years ago, so it may take a while.

I'm assuming this workflow is something new and that it has never worked before (not that this is a situation where something recently changed underneath it and broke it).

One comment about the test. This is not the cause of the assert. I'm still working on that. EDAlias should only be in the configuration with the producer. In that process, you can use either the original module label or the alias to get the product. In the output module, one of the two is kept (you can choose either). In subsequent processes, you cannot use EDAlias again for products from an earlier process. You must use the kept module label to get the product in later processes. So in run_consumer.py that EDAlias parameter is not going to accomplish anything and you must use the kept module label to get the product in the analyzer.

wddgit avatar Jun 05 '24 15:06 wddgit

Thanks a lot! From CMSSW I now realize that this type of producer (from files) have been used by HCAL for test beams - but maybe they haven't been exercised lately and this issue didn't come up.

On "EDAlias should only be in the configuration with the producer": good catch. I just copy pasted to run_consumer.py without thinking too much I can amend it.

pfs avatar Jun 05 '24 15:06 pfs

I am still working to remember or figure out how this is supposed to work, but this avoids the assert if added to the PoolSource in the consuming configuration:

labelRawDataLikeMC = cms.untracked.bool(False)

I'm not sure if that is the correct solution yet.

wddgit avatar Jun 05 '24 19:06 wddgit

Kind of arcane.

The labelRawDataLikeMC is a backward compatibility feature for reading data written with a release dating back to around 2013. If you read an input file made with a modern release and containing a product with type "FEDRawDataCollection" and module label "source" and have the feature turned on (the default), then it will fail with the assert that you encountered. I don't think turning it off is a problem in your use case and it will completely avoid that assert. (In very old data there is a module label conversion that needs to be done, there is a purpose to it).

If someone more familiar with how FEDRawDataCollection is currently created can verify the following, I would appreciate it. I'm not sure about this, but I think people do not commonly run into this problem because "FEDRawDataCollection" is usually written with a module label "rawDataCollector" (not "source"). I think this is convention and I don't know what problems is causes if you don't do that. I don't think the Framework cares, but modules configured with InputTags probably do. If no one confirms this, I'll continue to dig into this. Possibly an EDProducer is producing them or maybe an EDAlias is used and the "rawDataCollector" label is kept. I'll keep digging if no one confirms this.

wddgit avatar Jun 05 '24 21:06 wddgit

labelRawDataLikeMC = cms.untracked.bool(False)

I confirm this could work temporarily as a workaround. That already unblocks a few things on our side while we the cause is being diagnosed. Thanks a lot.

pfs avatar Jun 06 '24 12:06 pfs

Below are two relevant emails from 2011 and 2012 from the hyper news that explain the normal naming scheme for branches containing FEDRawDataCollection for real data. I'm not sure how the main DAQ sources implement this. @smorovic Can you briefly explain this? I also don't know the standard way test sources like yours implement this and if it is normally or "should" be any different.

I'm guessing the code actually producing the FEDRawDataCollection should be an EDProducer and not a source. It's not obvious to me any other way this could work, although possibly I am missing something.

Setting labelRawDataLikeMC to false may be the only alternative that will work if there is some reason that cannot be done. I don't know what problems this may cause in your use case. The option labelRawDataLikeMC was designed to handle data produced before the new convention went into force ~2012.

*** Discussion title: Framework and Edm Development

Hi Jean-Roch, The original request was to allow a job to 'replace' the FEDRawDataCollection read in from a file with one defined in the job without having to reset all the configurations for all modules which were reading that collection. I.e. drop in a module which has the same module label as the 'old' FEDRawDataCollection and have that module read the 'old' collection and create a 'new' collection which all other modules then use. The way to facilitate that was to 1) not use 'source' as the label name for FEDRawDataCollection from data since 'source' is a reserved label and 2) put the read data's FEDRawDataCollection (which essentially comes from the detector) into its own process name so it would not be possible to get a 'label' collision.

All real data FEDRawDataCollections taken from now on will use the 'rawDataCollector' and 'LHC' labels and our code resets the old data to use those labels as well in order to make new and old data completely compatible.

All this was discussed in depth and agreed upon with all the Offline Level 2s.

Hope that sufficiently explains the reasoning, Chris

Christopher Jones Fermi National Accelerator Laboratory [email protected]

*** Discussion title: Framework and Edm Development

Hi everyone, At today's Core software meeting we discussed the proposed change of the FEDRawDataCollection module label from 'source' to 'rawDataCollector' in the 5_0_X series. See the hypernews thread https://hypernews.cern.ch/HyperNews/CMS/get/edmFramework/2860/1/1/1/1/1/1/1/1/1/1/1/1.html

Both HLT and RECO L2 agreed that this would be a good idea. HLT requested that they not have to change their configurations until after the 2011 data taking has stopped. To accommodate that, we will provide backward compatibility switch to PoolSource so that HLT menus from 4_4_X and 4_2_X can still work without change in 5_0_X to allow easy comparison. The DaqSource does not need such a backwards compatibility switch in 5_X since all HLT menus would have to be updated anyway to run online in that release. The migration plan for the configurations is as follows. RECO will modify their configurations in time for 5_0_0 in coordination with framework committing the changes to PoolSource. HLT will wait until after pp+HI data taking of 2011 to update their configurations which will go into a 5_0_X release.

Chris

Christopher Jones Fermi National Accelerator Laboratory [email protected]

wddgit avatar Jun 06 '24 15:06 wddgit

The code fails the assert because the implementation of labelRawDataLikeMC uses a vector of ProcessConfiguration's read directly from the input file. This vector exists in the input file for pre 2012 data. So it works OK in that case. In PR #580 (from 2013), this vector was dropped from the file format and no longer exists. Its absence results in the assert failure.

We don't normally see this error in modern files, because the there are not FEDRawDataCollections with the old module label in the data so that code does not execute.

wddgit avatar Jun 06 '24 15:06 wddgit

I was not involved in that back in 2011/12. Currently, FRD input source uses provenance helper to put raw data collection into event as with rawDataCollector. Here are relevant snippets of code:

  daqProvenanceHelper_(edm::TypeID(typeid(FEDRawDataCollection)))

  processHistoryID_ = daqProvenanceHelper_.daqInit(productRegistryUpdate(), processHistoryRegistryForUpdate());


//in FedRawDataInputSource::read(edm::EventPrincipal& eventPrincipal) :

  std::unique_ptr<FEDRawDataCollection> rawData(new FEDRawDataCollection);

  std::unique_ptr<edm::WrapperBase> edp(new edm::Wrapper<FEDRawDataCollection>(std::move(rawData)));
  eventPrincipal.put(daqProvenanceHelper_.branchDescription(), std::move(edp), daqProvenanceHelper_.dummyProvenance());

Link to the read function location: https://github.com/cms-sw/cmssw/blob/master/EventFilter/Utilities/src/FedRawDataInputSource.cc#L679-L681

I don't see anything related to labelRawDataLikeMC flag in the provenance helper. Similar also happens in DAQSource (with extension for data types in scouting).

There is also a second "raw" source that isn't much used except for some tests. This one still produces FRD collection with source label: https://github.com/cms-sw/cmssw/blob/master/EventFilter/Utilities/plugins/FRDStreamSource.cc

smorovic avatar Jun 06 '24 16:06 smorovic

@pfs The solution I recommend is that you follow the pattern in FedRawDataInputSource and use the DaqProvenanceHelper. I've have not done that myself yet. If you have problems, ask for help here and I will take a look.

An alternative would be to set labelRawDataLikeMC false and continue using the "source" label. I don't recommend that and if it causes problems it would your responsibility to deal with them... If it is just for a local test and whatever you need it for works, maybe...

Another alternative would be to separate the part of the code that creates the FEDRawDataCollection and make it an EDProducer and give it a different label.

@Dr15Jones One thing we could do would be to modify the assert to be an exception with a more informative message. Does that sound like a good idea? Modifying the Framework to be able to change the module label in modern files does not seem worth spending time on. Or we could do nothing. Or maybe you have a better idea.

wddgit avatar Jun 06 '24 17:06 wddgit

@wddgit changing the assert to an exception seems like a good choice to me.

Dr15Jones avatar Jun 06 '24 17:06 Dr15Jones

thanks a lot for the pointers and investigation from all . I'll try along the lines suggested

pfs avatar Jun 06 '24 19:06 pfs

I just submitted a PR to change the assert to an exception. See #45159

wddgit avatar Jun 06 '24 19:06 wddgit

Dear @wddgit @smorovic thanks a lot for the guidance. I have moved to a source a la EventFilter/Utilities/interface/FedRawDataInputSource.h adapted for HGCAL needs and I confirm that I don't experience this issue any longer and I'm able to run successfully the output file in other jobs (DQM, NANO, etc.). From my side I'm happy with the solution found so I could close from the hgcal DPG side. thanks again. Pedro

pfs avatar Jun 12 '24 14:06 pfs

+1 Great! My PR was also just merged. I recommend closing this Issue also. I'm not sure my +1 does anything or who actually has authority to close it...

wddgit avatar Jun 12 '24 14:06 wddgit

+core

makortel avatar Jul 10 '24 20:07 makortel

@cmsbuild, please close

makortel avatar Jul 10 '24 20:07 makortel

This issue is fully signed and ready to be closed.

cmsbuild avatar Jul 10 '24 20:07 cmsbuild