WMCore
WMCore copied to clipboard
Report input files and lumi range for failed jobs in (T0) wmstats
Slava requested that we change this splitting to include the run/lumi information in the job name to help with debugging. He also requested some easier way to display run/lumi information for jobs in WMStats in general such as having a lumi number or range in the failed jobs IDs. This has been discussed on JIRA ticket https://its.cern.ch/jira/browse/CMSTZ-248 .
Andres, you have basically two requests in this GH issue: a) change the job id adding run/lumi information to it. This is not going to work, reason being that we don't control how many lumi sections and or lumi ranges each job can process, thus it will make the job ID length variable in a scale that we do not control. b) display the run/lumi information in wmstats (or via a wmstats REST API): this one looks more reasonable and I actually thought we had this information already, however, I see that information empty in the Production WMStats.
Do we have such information in the T0 wmstats? Can you point me to a workflow with paused jobs in a replay instance?
BTW, Repack jobs don't have a job Mask information, so we don't know which lumi sections and/or events will come out of those jobs:
'mask': {'LastRun': None, 'LastLumi': None, 'FirstRun': None, 'inclusivemask': True, 'runAndLumis': {}, 'LastEvent': None, 'FirstEvent': None, 'jobID': 1051, 'FirstLumi': None}
WMStats has both Input files
and Lumis
field to report such information. We have to investigate why it's not displaying those details and fix it. Maybe that would be enough for starting.
Option a would only be considered a workaround if option b) is not feasible or takes too long. We could probably devise a "uuid-lumiinfo" naming scheme that would work for option a (who cares how long the job name is and whether it's variable length or fixed...), but if option b is in principle supposed to be available, I'd rather go that direction.
We mostly care about PromptReco jobs here, but we can also fix the Repack (and maybe Express) jobs to add a lumi mask if that makes the monitoring more consistent.
(who cares how long the job name is and whether it's variable length or fixed...)
Don't forget we still use a relational database as a backend and it defines a schema for the wmbs_job table.
BTW, when would it be desirable to have this feature in the system? Is it only for Run3?
Slava asked for it and it would be to help debugging Tier0 jobs. So yes, mostly Run3. Could also be useful for debugging ReReco jobs though, which would mean later this year. But maybe this already works in WMStats and for some reason just not in the Tier0 WMStats?
Naah, I see the same problem in the production wmstats. However, I'm pretty sure there are cases that that information gets properly displayed too. Thanks, Dirk. I'm setting its milestones to somewhere mid of this year.
Dirk and Andres, I'm updating the subject of this issue to reflect what was discussed here.
For the record, this request
amaltaro_TaskChain_PUMCRecyc_HG1805_Validation_180426_130328_6844
in testbed has "valid" content in the Input files
and Lumis
fields in WMStats. Only possible problem is that that belongs to a successful job (after a retry). Check that for further info...
And this workflow amaltaro_StepChain_ReDigi3_HG1903_Validation_190304_090531_6089 also has the correct data in there, but it prints the PU input files as well. Maybe we could drop those somehow from the WMStats job report.
Hi @germanfgv @jhonatanamado @amaltaro,
Let me see if I can grasp the goal of this issue correctly. Here is one T0 PromptReco
request PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703
, which has failed jobs in it. One may look at the CouchDB record for it's failed jobs at this link:
https://cmsweb-testbed.cern.ch/t0_reqmon/data/jobdetail/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703
And as far as I can see with every failed job there is a lumis
fieled left blank, at which you'd want to have the information for all lumis this job was working on. Is my understanding correct so far?
yes, that's correct @todor-ivanov
hi German, While working on that and trying to observe the issue with an agent in production, I kind of found this feature is working well in the production system. E.g. : https://cmsweb.cern.ch/wmstatsserver/data/jobdetail/cmsunified_task_EGM-Run3Winter23Digi-00057__v1_T_230511_075100_3810
It seems the lumi lists for the failed jobs are present in this case.
I just noticed this hasn't been considered in Q2, so we should probably pause this investigation for now and re-evaluate it for Q3. @todor-ivanov
Hi @germanfgv , while working with the above mentioned workflow it is indeed missing the lumi lists for broken jobs in t0reqmon
: [1]
But it seems to be having them all listed in the workflow summary here: [2]
Wouldn't that suffice?
FYI: @amaltaro
[1] https://cmsweb-testbed.cern.ch/t0_reqmon/data/jobdetail/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703
PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703:
/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703/Reco:
jobfailed:
8020:
T2_CH_CERN:
errorCount: 2024
samples:
_id: "124e73e3-1c2f-48f3-8947-8352367bf54e-0"
_rev: "19-9e43904494b761cfd799a1d893253270"
wmbsid: 22748
type: "jobsummary"
retrycount: 3
workflow: "PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703"
task: "/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703/Reco"
jobtype: "Processing"
state: "jobfailed"
...
lumis:
outputdataset:
inputfiles:
[2] https://cmsweb-testbed.cern.ch/couchdb/t0_workloadsummary/_design/WorkloadSummary/_show/histogramByWorkflow/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703
PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703 Summary
No Output
Histogram :
Errors:
/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703/Reco
cmsRun1
exit code: 8020
details:
An exception of category 'FileOpenError' occurred while
[0] Constructing the EventProcessor
[1] Constructing input source of type PoolSource
[2] Calling RootInputFileSequence::initTheFile()
[3] Calling StorageFactory::open()
[4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://eoscms.cern.ch//eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root?eos.app=cmst0'
Additional Info:
[a] Input file root://eoscms.cern.ch//eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root?eos.app=cmst0 could not be opened.
[b] XrdCl::File::Open(name='root://eoscms.cern.ch//eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root?eos.app=cmst0', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] Unable to open file /eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root; No such file or directory
' (errno=3011, code=400). No additional data servers were found.
[c] Last URL tried: root://eoscms.cern.ch:1094//eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root?eos.app=cmst0&tried=&xrdcl.requuid=ed0fe7eb-03a1-4548-92c7-19268716c3b1
[d] Problematic data server: eoscms.cern.ch:1094
[e] Disabled source: eoscms.cern.ch:1094
type:
Fatal Exception
jobs: 12221
run and lumi range
349840
lumi range: [1,3399 - 1,3399]
input
: /store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root
: /store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/f1877c7b-c345-45bf-8cc8-c47bdc76715a.root
: /store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/d69380e2-fc23-4df6-9893-80b6475cee8e.root
...
Here follow few more observations and one helpful document added to the troubleshooting wiki of WMCore: [1]
While working withthe T0 workflows I also checked the Production Validation and I made an interesting discovery:
- Some of the failed jobs were having their lumilists properly recorded in
wmstatserver
: [2] - While at the some time the equivalent visualization with javascript supposed to display the failed job summary at
wmstats
was giving some inadequate lists of[0]
(see the print screen attached)
[1] https://github.com/dmwm/WMCore/wiki/trouble-shooting#unpikling-a-failed-job-pset-file-from-logreports
[2] https://cmsweb-testbed.cern.ch/wmstatsserver/data/jobdetail/tivanov_SC_LumiMask_Rules_June2023_Val_230705_172547_6951
But it seems to be having them all listed in the workflow summary here: [2] Wouldn't that suffice?
That's exactly the information we need. How can we get that info visualized in WMStats?
- While at the some time the equivalent visualization with javascript supposed to display the failed job summary at
wmstats
was giving some inadequate lists of[0]
(see the print screen attached)
Exactly, most of the time we get no lumis info in WMStats, but sometimes we get these lists of [0]
that don't offer much info.
Exactly, most of the time we get no lumis info in WMStats, but sometimes we get these lists of [0] that don't offer much info.
I am starting to suspect, the way how this module behaves, strongly depends on the type of failure and the job stage at which it happens.
Just for logging purposes:
I have double checked all couch views
and couchapps
in order to prove there is no problem with how we fetch the information related to job details from central CouchDB. And I can tell for sure now - the lumis
list is simply not uploaded to central couch neither for failed jobs nor for successful. At least not until the workflow is completed and the workfload summary is generated.
For the purpose I have instanced a WMStatsReader
to cmsweb-testbed:
In [1]: from WMCore.Services.WMStats.WMStatsReader import WMStatsReader
In [2]: reqdb_url = 'https://cmsweb-testbed.cern.ch/couchdb/t0_request'
In [3]: wmstats_url = 'https://cmsweb-testbed.cern.ch/couchdb/tier0_wmstats'
In [4]: wmstats = WMStatsReader(wmstats_url, reqdbURL=reqdb_url, reqdbCouchApp="T0Request")
And then directly called for the job info with slight modification to the couchview
options here, such that I print the full view per every job, with no aggregation by error etc:
In [5]: requestName = 'PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703'
In [6]: options = {'include_docs': True, 'reduce': False, 'startkey': [requestName], 'endkey': [requestName, {}]}
In [7]: results = wmstats._getCouchView("jobsByStatusWorkflow", options)
Out[7]:
{'offset': 336299,
'rows': [{'doc': {'_id': '124e73e3-1c2f-48f3-8947-8352367bf54e-0',
'_rev': '19-9e43904494b761cfd799a1d893253270',
'acdc_url': 'http://localhost:5984/acdcserver',
'agent_name': 'vocms0500.cern.ch',
'cms_location': 'T2_CH_CERN',
'eos_log_url': 'https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703/Reco/vocms0500.cern.ch-22748-3-log.tar.gz',
'errors': {'cmsRun1': [{'details': 'An exception of '
'category '
"'FileOpenError' "
...
'exitCode': 8020,
'type': 'Fatal Exception'}],
'logArch1': [],
'stageOut1': []},
'exitcode': 8020,
'inputfiles': [],
'jobtype': 'Processing',
'lumis': [],
'output': [{'checksums': {'adler32': '2b344c72',
'cksum': '884341033'},
'lfn': '/store/unmerged/data/logs/prod/2022/5/11/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703/Reco/0000/3/124e73e3-1c2f-48f3-8947-8352367bf54e-0-3-logArchive.tar.gz',
'location': 'T0_CH_CERN_Disk',
'size': 0,
'type': 'logArchive'}],
'outputdataset': {},
'retrycount': 3,
'site': 'T2_CH_CERN',
'state': 'jobfailed',
'state_history': [{'location': 'T2_CH_CERN',
'newstate': 'jobcooloff',
'oldstate': 'jobfailed',
'timestamp': 1652288000},
{'location': 'T2_CH_CERN',
'newstate': 'jobcooloff',
...