workflow-cps-plugin icon indicating copy to clipboard operation
workflow-cps-plugin copied to clipboard

Pause executions when terminating even when not quieting down

Open jglick opened this issue 1 year ago • 3 comments

I observed a large stream of errors from a controller shutting down. This was part of a Kubernetes restart so the pod would have gotten a termination signal yet Jenkins is not “quieting down”.

Redacted examples of errors
…	WARNING	o.j.p.w.cps.CpsVmExecutorService#reportProblem: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[…]]
org.jenkinsci.remoting.util.ExecutorServiceUtils$FatalRejectedExecutionException: Cannot execute the command org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1@25af3d30. The executor service is shutting down
	at hudson.remoting.SingleLaneExecutorService.execute(SingleLaneExecutorService.java:116)
	at jenkins.util.InterceptingExecutorService.execute(InterceptingExecutorService.java:71)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.notifyNewHead(CpsThreadGroup.java:502)
	at org.jenkinsci.plugins.workflow.cps.FlowHead.setNewHead(FlowHead.java:158)
	at org.jenkinsci.plugins.workflow.cps.CpsBodyExecution.addBodyEndFlowNode(CpsBodyExecution.java:414)
	at org.jenkinsci.plugins.workflow.cps.CpsBodyExecution$FailureAdapter.receive(CpsBodyExecution.java:345)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.throwException(ContinuationGroup.java:211)
	at com.cloudbees.groovy.cps.impl.CastBlock$ContinuationImpl.cast(CastBlock.java:67)
	at jdk.internal.reflect.GeneratedMethodAccessor677.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.dispatch(CollectionLiteralBlock.java:55)
	at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.item(CollectionLiteralBlock.java:45)
	at jdk.internal.reflect.GeneratedMethodAccessor636.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:152)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:146)
	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)
	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)
	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:146)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:187)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:423)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:331)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:295)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:97)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
	at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
…	WARNING	o.j.p.w.cps.CpsBodyExecution$2#onFailure: could not cancel CpsStepContext[313:parallel]:Owner[…]
Also:   org.jenkinsci.remoting.util.ExecutorServiceUtils$FatalRejectedExecutionException: Cannot execute the command org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1@25af3d30. The executor service is shutting down
		at hudson.remoting.SingleLaneExecutorService.execute(SingleLaneExecutorService.java:116)
		at jenkins.util.InterceptingExecutorService.execute(InterceptingExecutorService.java:71)
		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.notifyNewHead(CpsThreadGroup.java:502)
		at org.jenkinsci.plugins.workflow.cps.FlowHead.setNewHead(FlowHead.java:158)
		at org.jenkinsci.plugins.workflow.cps.CpsBodyExecution.addBodyEndFlowNode(CpsBodyExecution.java:414)
		at org.jenkinsci.plugins.workflow.cps.CpsBodyExecution$FailureAdapter.receive(CpsBodyExecution.java:345)
		at com.cloudbees.groovy.cps.impl.ContinuationGroup.throwException(ContinuationGroup.java:211)
		at com.cloudbees.groovy.cps.impl.CastBlock$ContinuationImpl.cast(CastBlock.java:67)
		at jdk.internal.reflect.GeneratedMethodAccessor677.invoke(Unknown Source)
		at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.base/java.lang.reflect.Method.invoke(Method.java:568)
		at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
		at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.dispatch(CollectionLiteralBlock.java:55)
		at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.item(CollectionLiteralBlock.java:45)
		at jdk.internal.reflect.GeneratedMethodAccessor636.invoke(Unknown Source)
		at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.base/java.lang.reflect.Method.invoke(Method.java:568)
		at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
		at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
		at com.cloudbees.groovy.cps.Next.step(Next.java:83)
		at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:152)
		at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:146)
		at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)
		at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)
		at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:146)
		at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
		at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
		at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:187)
		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:423)
		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:331)
		at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:295)
		at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:97)
…	WARNING	o.j.p.w.cps.CpsStepContext$2#onFailure: Failed to proceed after CpsStepContext[319:podTemplate]:Owner[…]
org.jenkinsci.remoting.util.ExecutorServiceUtils$FatalRejectedExecutionException: Cannot execute the command java.util.concurrent.FutureTask@3f381a40[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@6367fef2[Wrapped task = org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1@1e5feb9]]. The executor service is shutting down
	at hudson.remoting.SingleLaneExecutorService.execute(SingleLaneExecutorService.java:116)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:123)
	at jenkins.util.InterceptingExecutorService.submit(InterceptingExecutorService.java:46)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:970)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:965)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:147)
	at org.jenkinsci.plugins.workflow.support.concurrent.DirectExecutor.execute(DirectExecutor.java:33)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1298)
	at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:762)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:133)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:157)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:97)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.runInCpsVmThread(CpsFlowExecution.java:965)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.scheduleNextRun(CpsStepContext.java:381)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.completed(CpsStepContext.java:326)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onFailure(CpsStepContext.java:314)
	at org.jenkinsci.plugins.workflow.steps.StepExecution.stop(StepExecution.java:109)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService.reportProblem(CpsVmExecutorService.java:72)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:99)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
	at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

From earlier log messages I know all this occurred after suspendAll() completed. Yet by looking at the line numbers you can see https://github.com/jenkinsci/workflow-cps-plugin/blob/b3ff2ae3eeafbdabeca528451bf873ee99638884/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsThreadGroup.java#L331 being run, which is after the checks for a paused program. This seems wrong: if Jenkins is not just quieting down but actively terminating, and has in fact attained the EXECUTIONS_SUSPENDED milestone (#766), we would not expect CPS chunks to still be run. Apparently they do not run well (the executor service has already been shut down, hence these errors), but perhaps the process is still writing other metadata files even after checkpoint has been asked to save everything for the last time.

I could not figure out how to reproduce this situation in a test. The lack of pausing is reproduced easily in JenkinsSessionRule. RealJenkinsRule normally enters quiet-down mode which activates the Pausing (Preparing for shutdown) message, but even without https://github.com/jenkinsci/jenkins-test-harness/pull/544 you can simply replace a call to stopJenkins with runRemotely on

/**
 * Calls {@link Jenkins#doExit} which is more like {@link WebAppMain#contextDestroyed} than {@link RealJenkinsRule#stopJenkins}
 * in that it does not call {@link Jenkins#doSafeExit} and thus {@link Jenkins#doQuietDown()}.
 */
private static void exit(JenkinsRule r) throws Throwable {
    r.jenkins.doExit(null, null);
}

What I do not know how to reproduce is the FatalRejectedExecutionExceptions shown above. What causes the program to trigger work during this time? Maybe I need to artificially slow down shutdown and then have some sh steps finish?

jglick avatar Mar 06 '24 18:03 jglick

Managed to reproduce a stack trace, though not in fact involving CPS code:

Not really successful attempt at reproduction
WARNING	o.j.p.w.cps.CpsStepContext$2#onFailure: Failed to proceed after CpsStepContext[3:semaphore]:Owner[p/1:p #1]
org.jenkinsci.remoting.util.ExecutorServiceUtils$FatalRejectedExecutionException: Cannot execute the command java.util.concurrent.FutureTask@68ab1a80[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@7fb9a08c[Wrapped task = org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1@4f086a9f]]. The executor service is shutting down
	at hudson.remoting.SingleLaneExecutorService.execute(SingleLaneExecutorService.java:116)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
	at jenkins.util.InterceptingExecutorService.submit(InterceptingExecutorService.java:46)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:975)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:970)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:147)
	at org.jenkinsci.plugins.workflow.support.concurrent.DirectExecutor.execute(DirectExecutor.java:33)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
	at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:760)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:134)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:157)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:97)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.runInCpsVmThread(CpsFlowExecution.java:970)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.scheduleNextRun(CpsStepContext.java:381)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.completed(CpsStepContext.java:326)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onSuccess(CpsStepContext.java:318)
	at org.jenkinsci.plugins.workflow.test.steps.SemaphoreStep.success(SemaphoreStep.java:123)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecutionTest.resumeStepDuringShutdown(CpsFlowExecutionTest.java:435)

jglick avatar Mar 06 '24 19:03 jglick

While restarting a controller running

while (true) {
    sleep 1
}

I did see

Stack trace
WARNING	o.j.p.w.cps.CpsStepContext$2#onFailure: Failed to proceed after CpsStepContext[3312:sleep]:Owner[sleepy/1:sleepy #1]
org.jenkinsci.remoting.util.ExecutorServiceUtils$FatalRejectedExecutionException: Cannot execute the command java.util.concurrent.FutureTask@5e80b055[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@770436cf[Wrapped task = org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$$Lambda$2874/0x00007f3109567980@223047cf]]. The executor service is shutting down
	at hudson.remoting.SingleLaneExecutorService.execute(SingleLaneExecutorService.java:116)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:123)
	at jenkins.util.InterceptingExecutorService.submit(InterceptingExecutorService.java:46)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:975)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:970)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:147)
	at org.jenkinsci.plugins.workflow.support.concurrent.DirectExecutor.execute(DirectExecutor.java:33)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1298)
	at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:762)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:133)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:157)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:97)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.runInCpsVmThread(CpsFlowExecution.java:970)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.scheduleNextRun(CpsStepContext.java:381)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.completed(CpsStepContext.java:326)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onSuccess(CpsStepContext.java:318)
	at org.jenkinsci.plugins.workflow.steps.SleepStep$Execution.lambda$setupTimer$0(SleepStep.java:100)
	at …

but that seems expected.

jglick avatar May 15 '24 19:05 jglick

Had a job running like

checkpoint 'x' // CloudBees CI
build job: JOB_NAME, wait: false

and pressed Ctrl-C on java -jar. Got

…
INFO	jenkins.model.Jenkins$16#onAttained: Started termination
…
WARNING	o.j.p.w.cps.CpsFlowExecution$8#onFailure
org.jenkinsci.remoting.util.ExecutorServiceUtils$FatalRejectedExecutionException: Cannot execute the command java.util.concurrent.FutureTask@1a496d0f[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@406f73a7[Wrapped task = org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$$Lambda$1032/0x00007fc07cece900@10cd21b1]]. The executor service is shutting down
	at hudson.remoting.SingleLaneExecutorService.execute(SingleLaneExecutorService.java:116)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:123)
	at jenkins.util.InterceptingExecutorService.submit(InterceptingExecutorService.java:46)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:975)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:970)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:147)
	at org.jenkinsci.plugins.workflow.support.concurrent.DirectExecutor.execute(DirectExecutor.java:33)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1298)
	at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:762)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:133)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:157)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:97)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.runInCpsVmThread(CpsFlowExecution.java:970)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.suspendAll(CpsFlowExecution.java:1697)
	at …
WARNING	o.j.p.workflow.job.WorkflowRun#getListener
java.lang.IllegalStateException: trying to open a build log on x #746 after it has completed
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.getListener(WorkflowRun.java:232)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.getListener(WorkflowRun.java:1024)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.suspendAll(CpsFlowExecution.java:1708)
	at …
INFO	jenkins.model.Jenkins$16#onAttained: Completed termination
…

jglick avatar May 23 '24 12:05 jglick

This seems to fix a problem affecting an admittedly artificial test pipeline involving running hundreds or even thousands of echo steps (synchronous blocking) in a tight Groovy loop. Before, suspendAll would hang on this build, continuing to run more echo steps and doing a lot of FlowNode I/O work, with the whole JVM eventually getting killed off due to exceeding its K8s termination grace period, leading to a corrupt build (CloudBees-internal issue). With this patch, the build suspends smoothly and promptly and can be resumed later. However it improperly resumes from the last checkpoint, not after the last echo step, which I still need to understand (another CloudBees-internal issue, not apparently solved by #923).

jglick avatar Jun 26 '25 19:06 jglick

Ah I was just missing a save call in the new path.

jglick avatar Jun 26 '25 20:06 jglick