cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

upgrade CMSSW data reading via xroot protocol for packet labeling

Open stlammel opened this issue 9 months ago • 26 comments

During February O&C Week Marian Babik presented the network packet labeling project.

Could we please upgrade data reading in CMSSW to automatically set flow labels

196656 in case of Production Input 196664 in case of Analysis Input 196700 Secondary Input 196704 Simulation Pileup Input

in case of reading via xroot protocol? For xrootd the flow label can be set via CGI parameter, i.e. scitag.flow=<scitag_id> or just adding "?scitag.flow=<scitag_id>" to the URL path. (I believe Tier-0 already adds a "?eos.app=cmst0" to the path right now.)

I would make 196664 and 196700 the default for primary/secondary input and provide a mechanism for WM to override the default (always for primary input and in case of pileup library as secondary input).

Thanks,

  • Stephan

stlammel avatar Mar 05 '25 23:03 stlammel

cms-bot internal usage

cmsbuild avatar Mar 05 '25 23:03 cmsbuild

A new Issue was created by @stlammel.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Mar 05 '25 23:03 cmsbuild

assign core

makortel avatar Mar 05 '25 23:03 makortel

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Mar 05 '25 23:03 cmsbuild

How urgent is this, i.e. when should these tags be used in production?

What exactly does "Secondary Input" mean? Given the distinction from "Simulation Pileup Input", I'd assume the "two file solution", but then I'm confused of

I would make 196664 and 196700 the default for primary/secondary input and provide a mechanism for WM to override the default (always for primary input and in case of pileup library as secondary input).

because the "secondary input" of two-file solution is orthogonal to "pileup input", and theoretically both could be used in the same job.

makortel avatar Mar 05 '25 23:03 makortel

I suspect we don't have any easy/straightforward way to communicate the nature of "Production Input" / "Secondary Input" / "Simulation Pileup Input" from the PoolSource / EmbeddedRootSource to XrdAdaptor (so that XrdAdaptor would do the modification of URL).

If that is the case, I guess the PoolSource / EmbeddedRootSource would have to add the CGI parameter to the PFN in case of root:// protocol or something. So far we have been able to keep the Sources in general agnostic on the protocols.

makortel avatar Mar 05 '25 23:03 makortel

Hallo Matti, not urgent, packet labeling is an HL-LHC project but of course we would like to have it well before then. I thought we had the option to read/access missing information from the parent dataset. That's what i thought of secondary input until the pileup library came. I assume the two use a similar "second input". If the two a separate, then both can be set directly and only an override for the primary input is needed. Yes, i think it could be done after the LFN to PFN translation before/outside of XrdAdaptor via a PFN manipulation, thus me mentioning the Tier-0 query add. Thanks, cheers, Stephan

stlammel avatar Mar 05 '25 23:03 stlammel

but of course we would like to have it well before then.

What could that mean? End of this year? End of 2026? End of 2027?

I thought we had the option to read/access missing information from the parent dataset.

This corresponds to the "two-file solution" I mentioned above.

I assume the two use a similar "second input". If the two a separate, then both can be set directly and only an override for the primary input is needed.

The technical mechanism is indeed separate. Which is also why "secondary input" is in many occasions rather confusing name :) Note that we envision being able to have even more kinds of inputs.

What if an analysis job uses "Secondary Input"?

makortel avatar Mar 06 '25 00:03 makortel

Well, there is no hard need-by date. The earlier the better, the sooner we can see what goes over the network. It would be great if this were available next year. We could distinguish production and analysis secondary input further but i assume it's small compared to primary input (except pileup) thus thought one activity for both production and analysis secondary input, and separating pileup was more important. If you feel it's important, we could add an activity.

  • Stephan

stlammel avatar Mar 06 '25 01:03 stlammel

Well, there is no hard need-by date. The earlier the better, the sooner we can see what goes over the network. It would be great if this were available next year.

Ok, thanks. I interpret this as "would be great to have in CMSSW_16_0_0" (the data taking release of next year).

We could distinguish production and analysis secondary input further but i assume it's small compared to primary input (except pileup) thus thought one activity for both production and analysis secondary input, and separating pileup was more important. If you feel it's important, we could add an activity.

I can't tell if that would be important (I feel this classification should primarily come from elsewhere).

Few questions more

Does "Simulation Pileup Input" cover both the MinBias files read by classical mixing (and premixing stage1 that creates the premix library), and the Premixed files read by premixing overlay?

What about other users of EmbeddedRootSource? Beyond MixingModules it is presently used in SiStripSpyEventMatcherModule, but I guess it is run in grid, and I don't know if it would read over xrootd.

Should we classify all present and future "secondary files" (via PoolSource's secondaryFileNames, via EmbeddedRootSource, via something we add in the future) that are not about pileup as "Secondary Input" ?

makortel avatar Mar 06 '25 17:03 makortel

Hallo Matti, so, i would make all "Secondary Input"/?use of EmbeddedRootSource? report as activity 196700 except for the Premixed Library files (activity 196704). So, classical mixing would be 196700 in this arrangement. (If the default activity 196700 can be overidden (via config or env variable) then we could separate the min bias dataset input for classical mixing later, if needed. With ongoing HL-LHC discussions of maybe generating the pileup on the fly, i cannot exclude the need to adjust/correct our current strategy before/during HL-LHC.) Thanks, cheers, Stephan

stlammel avatar Mar 06 '25 18:03 stlammel

Just to make sure I understand how this works. The text string "?scitag.flow=<scitag_id>" gets appended to the PFN. Once that is done with the correct value for the ID, then XROOTD will do the rest of the work. Is that correct?

Does it just get ignored if we are not using XROOTD?

wddgit avatar Oct 13 '25 21:10 wddgit

Just to make sure I understand how this works. The text string "?scitag.flow=<scitag_id>" gets appended to the PFN. Once that is done with the correct value for the ID, then XROOTD will do the rest of the work. Is that correct?

Correct (to my understanding).

Does it just get ignored if we are not using XROOTD?

No, generally it won't get ignored (e.g. local file storage plugin would interpret it as a file name), so it should be restricted to root: protocol (maybe better to make the code itself more generic though).

makortel avatar Oct 13 '25 23:10 makortel

Yes, this is also my understanding. I would expect a few other protocol names to be mapped to xrootd like "xroot:", "roots:", etc., no?

  • Stephan

stlammel avatar Oct 14 '25 06:10 stlammel

I would expect a few other protocol names to be mapped to xrootd like "xroot:", "roots:", etc., no?

The protocols recognized by TFileAdaptor are root: and xroot: (other protocols work but the behavior is purely up to ROOT)

Although I doubt if xroot: really works, because StorageFactory appears to use everything from the path up to the first : as the protocol, and ask the storage plugin with the protocol https://github.com/cms-sw/cmssw/blob/631c4b24f310cf5c17e5b057220692a7028afba9/Utilities/StorageFactory/src/StorageFactory.cc#L120-L131 https://github.com/cms-sw/cmssw/blob/631c4b24f310cf5c17e5b057220692a7028afba9/Utilities/StorageFactory/src/StorageFactory.cc#L113 and only the root protocol is registered for XrdAdaptor https://github.com/cms-sw/cmssw/blob/631c4b24f310cf5c17e5b057220692a7028afba9/Utilities/XrdAdaptor/plugins/XrdStorageMaker.cc#L207

makortel avatar Oct 14 '25 22:10 makortel

provide a mechanism for WM to override the default (always for primary input and in case of pileup library as secondary input).

@stlammel Would you have any further input on what kind of hooks the WM would like to be able to override the default?

What is the impact of the freeze of WMCore and the development of the new WM system?

makortel avatar Oct 14 '25 22:10 makortel

Thanks Matti, then we are settled on "root:"!

I don't know how the pile-up library/dataset is specified. I would keep it together with the dataset (file llist) specification. Maybe for both primary and pile-up datset dataset an additional keyword/parameter, like scitagID = cms.untracked.uint32(196704)?

Thanks, cheers, Stephan

stlammel avatar Oct 15 '25 06:10 stlammel

I'm thinking that maybe we should also allow a configuration parameter value that turns packet labeling off, maybe 0 means don't add the CGI parameter to the PFN. Or do we have enough confidence that packet labeling is sufficiently benign and negligible in terms of performance that we always leave it on and don't need to waste effort building in the ability to turn it off?

Should we restrict the allowed values to ones discussed above and throw an exception if other values are configured?

wddgit avatar Oct 15 '25 14:10 wddgit

Yes, being able to disable sounds like a usefull feature. Maybe -1 instead of 0? The CMS SciTag id range is from 196612 to 196860 (inclusive), so anything outside we could/should flag as error/or disregard.

stlammel avatar Oct 15 '25 15:10 stlammel

@stlammel Rereading this thread again I see CMSSW would be expected to use default values for some of the cases. Just to double check if I understood the request correctly

  • PoolSource would use 196664 by default ("analysis input")
    • For both primary and secondary files in case of 2-file solution
    • Setting the value to 196656 ("production input") would be up to WM
  • EmbeddedRootSource via PreMixingModule would use 196704 by default ("Simulation Pileup Input")
  • EmbeddedRootSource via other modules would use 196700 by default ("Secondary Input")
    • Or maybe just for MixingModule for classical mixing? And leave other uses of EmbeddedRootSource to "not set" state?

Is this a correct interpretation of the request?

I would not do any validation of these parameters inside CMSSW, i.e. CMSSW passes on whatever is given. This follows from our usual approach towards computing monitoring, i.e. we do not abort the application just because something would be fishy towards monitoring.

makortel avatar Oct 20 '25 17:10 makortel

Hallo Matti, yes, sounds good. If a value outside the CMS assigned SciTag id range is provided, we should ignore the attempted override. I would not limit the 196700 default setting to only classical mixing but any secondary input. It's hopefully a small WM enhancement the current operations team can make. Thanks, cheers, Stephan

stlammel avatar Oct 20 '25 19:10 stlammel

@stlammel Below is a draft of an interface Matti, Chris, and I came up with and we are still discussing the merits of it. We wanted your opinion (and the opinion of anyone else you think should be involved in this discussion). It allows some flexibility for change in the future and puts the interface in the configuration of the SiteLocalConfigService. WM would only need to deal with the one line second part unless they wanted to change something other than turning on and off the production option. We would take care of the defaults.

process.SiteLocalConfigService.urlAppendValues = cms.untracked.PSet(
    allowed = cms.untracked.vstring("196656", "196664", "196700", "196704"),
    cases = cms.untracked.PSet(
        primary = cms.untracked.string("196664"),
        embedded = cms.untracked.string("196700"),
        premixedPileup = cms.untracked.string("196704"),
    ),
    protocols = cms.untracked.VPSet(
        cms.untracked.PSet(
            protocol = cms.untracked.string("^x?root:"),
            argument = cms.untracked.string("scitag.flow"),
        )
    )
)

with WM customization

process.SiteLocalConfigService.urlAppendValues.cases.primary = "196656"

being independent of the protocol.

wddgit avatar Nov 21 '25 21:11 wddgit

Hallo David, i like the config specifying the defaults and then having an override. Is SiteLocalConfigService the right place, though? This is not site specific but a CMS-wide config overridden not for a site but workflow/workflow class. Thanks, cheers, Stephan

stlammel avatar Nov 22 '25 07:11 stlammel

Is SiteLocalConfigService the right place, though? This is not site specific but a CMS-wide config overridden not for a site but workflow/workflow class.

Not really, but it is the component that presently contains the full definitions of LFN-to-PFN conversion.

In addition, I could argue already the site-local-config.xml contains configuration options that really should not be site-specific (anymore), or at least not apply globally to a given site (like the knobs for CMSSW caching behavior should at minimum depend on the access protocol).

makortel avatar Nov 24 '25 15:11 makortel

Agreed, we have already general config information duplicated in each site-local-config.xml. I am still hoping we can clean this up, maybe for HL-LHC? This could be an/the opportunity to do this. (We could check if Duong is interested in this and has time if you/core team support this.)

  • Stephan

stlammel avatar Nov 24 '25 15:11 stlammel

Agreed, we have already general config information duplicated in each site-local-config.xml. I am still hoping we can clean this up, maybe for HL-LHC? This could be an/the opportunity to do this.

I think it would be great if we could clean up unnecessary information from site-local-config.xml. I'm a bit worried though about the old software though.

makortel avatar Dec 04 '25 23:12 makortel