coffea-casa icon indicating copy to clipboard operation
coffea-casa copied to clipboard

Debugging KilledWorker exceptions appearing at scale

Open alexander-held opened this issue 2 years ago • 1 comments

When running this CMS Open Data ttbar analysis at the UChicago coffea-casa instance over the full number of input files with a pure coffea setup, RuntimeError exceptions start appearing typically somewhere around halfway at the pre-processing stage:

KilledWorker: ('automatic_retries-5a5ee0b8-ee99-4ff5-8534-7e2373a71078-19743', <WorkerState 'tls://c006.af.uchicago.edu:36499', name: htcondor--194595.0--, status: closed, memory: 0, processing: 74>)

RuntimeError: Work item FileMeta(https://xrootd-local.unl.edu:1094//store/user/AGC/datasets/RunIIFall15MiniAODv2/WJetsToLNu_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/MINIAODSIM/PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext2-v1/60002/1C9F04FB-72DB-E511-AFF7-0CC47A4D9A70.root:events) caused a KilledWorker exception (likely a segfault or out-of-memory issue)

The filename changes between repeated runs, so it does not seem to be related to a specific problematic input. Here is another example:

RuntimeError: Work item FileMeta(https://xrootd-local.unl.edu:1094//store/user/AGC/datasets/RunIIFall15MiniAODv2/ST_tW_top_5f_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/MINIAODSIM/PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1/70000/8E03C8E8-C7B8-E511-8A04-00259029E84C.root:events) caused a KilledWorker exception (likely a segfault or out-of-memory issue)

I am not sure how to best debug this further and would be happy to try out some suggestions.

alexander-held avatar Jul 11 '22 10:07 alexander-held