WMCore
WMCore copied to clipboard
Keep track of job classads requested by SI for monitoring and enhanced job scheduling
Related to #8958
Impact of the new feature Report classads needed for monitoring and enhanced job scheduling at the SI level. Keep a list of what has been implemented already and what is missing
Additional context List based on SI reference from here
List of classads:
Classad | Implemented already? | Description |
---|---|---|
CMS_Type | Yes | Job type: analysis, production, tier0, test |
AccountingGroup | Yes | |
AcctGroup | Yes | analysis, highprio, production, tier0 |
CMS_JobType | Yes | production: Processing, Production, Merge, LogCollect,Cleanup. tier0: Express, Merge, Repack |
CMSSW_Versions | Yes | E.g.: "CMSSW_10_2_13" |
CMSSW_Versions | Yes | E.g.: "CMSSW_10_2_13" |
DESIRED_Archs | Yes | E.g.: "INTEL,X86_64" |
REQUIRED_OS | Yes | E.g.: rhel6, rhel7 |
WMAgent_RequestName | Yes | Indicates the production workflow to which a job belongs, such as: wmagent_TC_PreMix_khurtado_TC_PreMix_190828_221230_7707 |
WMAgent_SubTaskName | Yes | String example: /wmagent_TC_PreMix_khurtado_TC_PreMix_190828_221230_7707/EXO_RunIIFall18wmLHEGS_00129_0/EXO_RunIIAutumn18DRPremix_00492_0/EXO_RunIIAutumn18DRPremix_00492_1/EXO_RunIIAutumn18DRPremix_00492_1MergeAODSIMoutput/EXO_RunIIAutumn18MiniAOD_00502_0/EXO_RunIIAutumn18MiniAOD_00502_0CleanupUnmergedMINIAODSIMoutput |
CMSGroups | Yes | Groups from FQAN, E.g.: B2G, EXO, HIG, JME, SMP, SUS, TOP |
DESIRED_CMSDataset | Yes | Primary input dataset, example: /ZJetToEE_Pt-120to170_TuneZ2star_8TeV_pythia6/Summer12_DR53X-PU_S10_START53_V7A-v1/AODSIM |
DESIRED_CMSDataLocations | Yes | Description |
DESIRED_CMSPileups | Yes | Secondary input dataset, example: `/Neutrino_E-10_gun/RunIISummer19ULPrePremix-UL17_106X_mc2017_realistic_v6-v1/PREMIX |
MaxWallTimeMins | Yes | Job run time request, related to EstimatedSingleCoreMins, and number of cores on which the job runs |
MinCores, MaxCores, OriginalCpus, RequestCpus | Yes | Requested CPU cores, also considering that job can be resized in the range defined from min and max values. Maximum number usually limited at 15 cores. We could think of dynamically select max bound based on the total CPU cores available in the slot. Also, we could introduce some rank to the slots? Maybe just 2x initially |
RequestDisk | Yes | Disk requested |
RequestMemory | Yes | Memory requested |
TransferInput | Yes | Input sandbox files |
TransferInputSizeMB | Yes | Sandbox Size |
TransferOutput | Yes | Output sandbox |
Estimated_InputRate | No | IO requirements for job |
CMS_WMTool | Yes | Output: "WMAgent" |
CMS_SubmissionTool | Yes | Output: "WMAgent" |
CMS_CampaignName | Yes | Comma separated string with the job campaign names |
CMS_extendedTaskType | Yes | Refer to #10604 |
CMS_PrimaryInputLocation | No | Refer to #8958. Values are: (Onsite, Offsite, Mixed) |
CMS_SecondaryInputLocation | No | Refer to #8958. Values are: (Onsite, Offsite, Mixed) |
GlobalTag | No | GlobalTag defined in Request |
GLIDEIN_CMSSubsiteName | No | Subsite / resource name within the same site |
GLIDEIN_CMSSite | Yes | CMSSite name from machine classad attributes |
GLIDEIN_Gatekeeper | Yes | Gatekeeper name from machine classad attribures |
Others? Maybe some coming from request (to double check/confirm with Antonio)
Classad | Implemented already? | Description |
---|---|---|
SizePerEvent | No | From request description |
PrimaryDataset | No | From request description |
Thanks, Kenyi For the purpose of monitoring, I'd also add to job classad the sub site tag (CMSSubsiteName) that some of our pilots are starting to use (for site extensions, opportunistic resources, etc). Not sure how that needs to be implemented though. Something like "MATCH_GLIDEIN_CMSSubsiteName"?
@aperezca So, it seems this was introduced 4 months ago: https://gitlab.cern.ch/CMSSI/CMSglideinWMSValidation/commit/2e297e5e014a9bcb0366f5c86e4b4975efabc1af#9a79982274b6f9e57134f8d013f535564fae0131
If we create `GLIDEIN_CMSSubsiteName", would the be enough? If so, what should be the value for that variable? I'm not sure I understand what a "subsite" is.
@khurtado This variable should take strings as values, and the content serves as a tag to indicate on which resources, within a certain site, the pilots, and therefore the payload jobs, were executed. This is for monitoring and debugging purposes. A site may aggregate diverse CPU resources, such the usual pledged WN farm, but also expand into cloud, opportunistic slots in the university campus, etc. Each one of those fractions of the total can be considered sub-sites for the purposes of classifying where jobs run. Also, we can consider a federated "CMS site", where a certain T1 or T2 is actually the union of a number of geographically separated computing centers acting in a coordinated way. Each one of the centers would be a subsite. Another example, consider CINECA HPC center, that we are using via CNAF T1. We assign jobs and send pilots to T1_IT_CNAF, but the site can decide to expand opportunistically into HPC slots. CNAF-CINECA would be a subsite of T1_IT_CNAF.
@aperezca Thanks! So, we just need to make sure this is propagated in the job classads from the machine attribute classads, like we do with GLIDEIN_CMSSite
or GLIDEIN_Gatekeeper
, right?
If so, we can just add the new attribute here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L639
I will add the parameter to the table
Yes! By the way, does this apply both to CRAB and WMAgent jobs? Do we need any extra step?
Coming back to this old topic (4 years!) I'd propose to also consider importing all/most relevant parameters created by jobs in the FJR, as it is helpful in order to debug/understand/optimize CPU efficiency and "badput"results. Apparently, as per recent discussions on this matter, there are certain internal job timing metrics that seem to be only available there (how much time is spent in initialization versus event loop, remote reads, etc). Therefore, parsing that FJR info and pushing it to the classad of completed jobs (which can then be retrieved by the monitoring scripts) would be very helpful.
@aperezca Antonio, apologies for the belated reply. FJR based metrics are already provided through the WMArchive monitoring, and we are also extending the metrics to pretty much cover all of the performance metrics available in the FJR. That might be done in the next week or so.
It looks like we have covered almost everything that is listed in the initial description (@khurtado is finalizing the campaign and task name, which we can then update in the initial table).
Said that, I wonder how we should rank the remaining metrics? Some of them have a different source - or perhaps there is not even a reliable source - reason why we have been spawning sub-tasks (tickets) to deal with each specific case. Please let us know what should be considered next such that we can properly plan it for Q4/2023.
@aperezca Coming back to this topic, out of the 34 in the original table, we have 7 pending. If you could rank the implementation preference of the remaining metrics, that would be great! That way, we can plan ahead in the future quarters for the corresponding implementation. At the same time, if there are any that you feel are not relevant anymore (or new ones we should be aware of), please let us know.
Classad | Implemented already? | Description |
---|---|---|
Estimated_InputRate | No | IO requirements for job |
CMS_PrimaryInputLocation | No | Refer to #8958. Values are: (Onsite, Offsite, Mixed) |
CMS_SecondaryInputLocation | No | Refer to #8958. Values are: (Onsite, Offsite, Mixed) |
GlobalTag | No | GlobalTag defined in Request |
GLIDEIN_CMSSubsiteName | No | Subsite / resource name within the same site |
Others? Maybe some coming from request (to double check/confirm with Antonio)
Classad | Implemented already? | Description |
---|---|---|
SizePerEvent | No | From request description |
PrimaryDataset | No | From request description |
from monitoring side, I can comment on the most requested feature by users:
CMS_PrimaryInputLocation - Refer to https://github.com/dmwm/WMCore/issues/8958. Values are: (Onsite, Offsite, Mixed) CMS_SecondaryInputLocation - Refer to https://github.com/dmwm/WMCore/issues/8958. Values are: (Onsite, Offsite, Mixed)