opentelemetry-plugin
opentelemetry-plugin copied to clipboard
Potential deadlock from GraphListenerAdapterToPipelineListener
Jenkins and plugins versions report
Environment
* core: 2.387.3
* opentelemetry:2.11.0
* workflow-api:1208.v0cc7c6e0da_9e
* workflow-cps:3653.v07ea_433c90b_4
* workflow-job:1289.1291.vb_7c188e7e7df
* workflow-support:839.v35e2736cfd5c
Issue
Jenkins shows a deadlock between open-telemetry GraphListenerAdapterToPipelineListener
and the Jenkins initialization thread:
==============
Deadlock Found
==============
"Jenkins initialization thread" id=50 (0x32) state=BLOCKED cpu=95%
- waiting to lock <0x23e558fe> (a org.jenkinsci.plugins.workflow.cps.CpsFlowExecution)
owned by "Running CpsFlowExecution[Owner[myfolder/myjob/17:myfolder/myjob #17]]" id=1246 (0x4de)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.isComplete(CpsFlowExecution.java:1277)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getCurrentExecutions(CpsFlowExecution.java:1019)
at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ResumeStepExecutionListener.onResumed(FlowExecutionList.java:267)
at org.jenkinsci.plugins.workflow.flow.FlowExecutionListener.fireResumed(FlowExecutionListener.java:85)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:589)
at hudson.model.RunMap.retrieve(RunMap.java:233)
at hudson.model.RunMap.retrieve(RunMap.java:61)
at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:650)
at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:632)
at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:530)
at hudson.model.RunMap.getById(RunMap.java:213)
at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:965)
at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:977)
at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:76)
at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:68)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:146)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:141)
at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:197)
at jenkins.model.Jenkins.<init>(Jenkins.java:1040)
at hudson.model.Hudson.<init>(Hudson.java:86)
at hudson.model.Hudson.<init>(Hudson.java:82)
at hudson.WebAppMain$3.run(WebAppMain.java:247)
"Running CpsFlowExecution[Owner[myfolder/myjob/17:myfolder/myjob #17]]" id=1246 (0x4de) state=BLOCKED cpu=57%
- waiting to lock <0x76460394> (a hudson.model.RunMap)
owned by "Jenkins initialization thread" id=50 (0x32)
at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:519)
at jenkins.model.lazy.LazyBuildMixIn.getBuildByNumber(LazyBuildMixIn.java:237)
at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:234)
at org.jenkinsci.plugins.workflow.job.WorkflowJob.getBuildByNumber(WorkflowJob.java:105)
at hudson.model.Run.fromExternalizableId(Run.java:2498)
at org.jenkinsci.plugins.workflow.support.steps.build.DownstreamFailureCause.getDownstreamBuild(DownstreamFailureCause.java:47)
at org.jenkinsci.plugins.workflow.support.steps.build.DownstreamFailureCause.getShortDescription(DownstreamFailureCause.java:63)
at io.jenkins.plugins.opentelemetry.job.MonitoringPipelineListener.lambda$endCurrentSpan$11(MonitoringPipelineListener.java:308)
at io.jenkins.plugins.opentelemetry.job.MonitoringPipelineListener$$Lambda$1490/0x00000008425bd440.apply(Unknown Source)
at [email protected]/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at [email protected]/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at [email protected]/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at [email protected]/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at [email protected]/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at [email protected]/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at [email protected]/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at io.jenkins.plugins.opentelemetry.job.MonitoringPipelineListener.endCurrentSpan(MonitoringPipelineListener.java:308)
at io.jenkins.plugins.opentelemetry.job.MonitoringPipelineListener.onEndParallelStepBranch(MonitoringPipelineListener.java:291)
at io.jenkins.plugins.opentelemetry.job.jenkins.GraphListenerAdapterToPipelineListener.fireOnAfterEndParallelStepBranch(GraphListenerAdapterToPipelineListener.java:129)
at io.jenkins.plugins.opentelemetry.job.jenkins.GraphListenerAdapterToPipelineListener.onNewHead(GraphListenerAdapterToPipelineListener.java:64)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1588)
at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.notifyNewHead(CpsThreadGroup.java:492)
at org.jenkinsci.plugins.workflow.cps.FlowHead.setNewHead(FlowHead.java:158)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1297)
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.croak(CpsFlowExecution.java:882)
at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.reportProblem(CpsVmExecutorService.java:57)
at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:69)
at [email protected]/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
at jenkins.util.ErrorLoggingExecutorService$$Lambda$1086/0x0000000841d64840.run(Unknown Source)
at [email protected]/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at [email protected]/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at [email protected]/java.lang.Thread.run(Thread.java:829)
Could be due to the usage of a synchronous GraphListener:
- https://github.com/jenkinsci/opentelemetry-plugin/blob/opentelemetry-2.11.0/src/main/java/io/jenkins/plugins/opentelemetry/job/jenkins/GraphListenerAdapterToPipelineListener.java
- https://github.com/jenkinsci/workflow-api-plugin/blob/1208.v0cc7c6e0da_9e/src/main/java/org/jenkinsci/plugins/workflow/flow/GraphListener.java#L48
What Operating System are you using (both controller, and any agents involved in the problem)?
-
JVM Implementation
- Name: OpenJDK 64-Bit Server VM
- Vendor: Red Hat, Inc.
- Version: 11.0.19+7-LTS
-
Operating system
- Name: Linux
- Architecture: amd64
- Version: 3.10.0-1160.92.1.el7.x86_64
Reproduction steps
No consistently reproducible. But basically:
- launch many many resumable pipelines
- restart Jenkins
Expected Results
No deadlocks
Actual Results
deadlock
Anything else?
No response
Did you mean to show a thread dump excerpt here?
@jglick it's in the "Issue" <details>
in the description.
Sorry about that. I tried to follow the issue template of the project. I can make it more visible.
@Dohbedoh @dwnusbaum do you have recommendations on how we can fix this? I looked at the logic of the code and it sounded reasonable to me.
@cyrille-leclerc GraphListenerAdapterToPipelineListener
should probably not extend GraphListener.Synchronous
, just GraphListener
, see https://github.com/jenkinsci/workflow-api-plugin/blob/ca5fddb3fcebcc6acf823c80600ec7ab9deb0d05/src/main/java/org/jenkinsci/plugins/workflow/flow/GraphListener.java#L48. Or perhaps you could modify it to only perform a small subset of the current logic synchronously if necessary and then offload all of the complex behavior that acquires locks to a separate thread.
Thanks @dwnusbaum .
Some of it definitively needs to be synchronous so that we can associate a span with the FlowNode
. I'll brainstorm to see if something can be made async and how we would orchestrate the sync and async parts.