DIRAC
DIRAC copied to clipboard
Proposal for input data resolution format
A file like dirac_job_input.json should be written by the InputDataResolution module containing:
{
# Zero or more LFNs from the JDL
"LFN:/vo/first/lfn": {
# List of replicas, mixture of protocols and sites, ordered by priority
"replicas": [
{
"url": "/current/working/directory/xxx.dst",
"se": "LOCAL-USER"
},
{
"url": "https://host1.invald/vo/first/lfn",
"se": "EXAMPLE-USER"
},
{
"url": "root://host2.invald//vo/first/lfn",
"se": "EXAMPLE-USER"
},
{
"url": "https://host3.invald/vo/first/lfn",
"se": "OTHER-USER"
},
{
"url": "root://host4.invald/vo/first/lfn",
"se": "OTHER-USER"
}
],
# We can extend to this contain other file-level metadata
# For now this seems reasonable to me
"size_bytes": 123456,
"checksum": {
"adler32": "ffffff"
}
}
}
Some additional context and info:
- LHCb uses the pool_xml_catalog.xml which contains similar information, but it is a format "for Gaudi" so for LHCb
- the format above is "just a format" that do not have pretenses to be "standard".
- the actual applications may want to use this file as-is, or not
JobWrapper also prints messages like "GUIDs not found from POOL XML Catalogue (and were generated) for: LFN:/ctao.dpps.test/t_20251001_075710_be33/test.dl1.h5" which are at a minimum misleading.
A bit more context:
In the JobWrapper, when it is time to upload the outputs, failoverTransfer.transferAndRegisterFile() passes a GUID to DataManager. This GUID is, at the moment, either found in pool_xml_catalog or (effectively for anyone but LHCb) generated on the fly with makeGUID(). So the current tooling can't be "just removed" because it would break LHCb uploads. The JobWrapper should then find the PoolXML thingy in the extension.
So long as we can extend makeGUID we could avoid needing to do anything else. It's cheap to extract it from the file these days.
Me and @aldbr are looking at implementing a prototype of this in dirac-cwl-proto.