workflow-api-plugin
workflow-api-plugin copied to clipboard
Ensure log streams are flushed before garbage collection when using `GCFlushedOutputStream` by switching to `Cleaner`
See https://github.com/jenkinsci/workflow-api-plugin/pull/83#discussion_r2018964003. Alternative to https://github.com/jenkinsci/workflow-api-plugin/pull/387.
Testing done
See new automated tests. I ran them manually against pipeline-cloudwatch-logs, which does not use the classes changed by this PR and so is unaffected (the new tests are currently skipped, but they pass even if not skipped), as well as CloudBees Pipeline Explorer, which passes with the new tests.
Submitter checklist
- [x] Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
- [x] Ensure that the pull request title represents the desired changelog entry
- [x] Please describe what you did
- [ ] Link to relevant issues in GitHub or Jira
- [x] Link to relevant pull requests, esp. upstream and downstream changes
- [x] Ensure you have provided tests - that demonstrates feature works or fixes the issue
run it through the PCT
and pipeline-cloudwatch-logs (need an AWS account to run and some vars like AWS_PROFILE, AWS_REGION, AWS_CHAINED_ROLE, CLOUDWATCH_LOG_GROUP_NAME)
I finally had a bit of time to test this. As far as I can see, pipeline-cloudwatch-logs does not care either way, because it does not use BufferedBuildListener, GCFlushedOutputStream, or DelayBufferedOutputStream. I did run its tests though and they passed with this PR as well as #387. Tests in workflow-durable-task-step also passed with both PRs.
The only way I could figure out to make LogStorageTestBase#remoting require GCFlushedOutputStream to pass with this PR was as follows:
- Replace all uses of
DelayBufferedOutputStreamwithBufferedOutputStreamFileLogStorageTest#remotingwas still passing
- Remove the call to
flushinLogStorageTestBase$RemotePrintFileLogStorageTest#remotingwas failing consistently because the remote log output was never received by the controller
- Adjust
LogStorageTestBase$GCto callSystem#gcrepeatedly in a loop for around 5 seconds so that the cleanup task would run (I tried to investigate why this was needed, but acquiring a heap dump consistently triggered the cleanup task immediately. CallingSystem#gconce and sleeping for a long time did not consistently trigger cleanup)FileLogStorageTest#remotingwas passing, but flaky, because the listeners in the twoRemotePrintcalls were being flushed in random order
- Add a call to
GCafter the firstRemotePrintcall, to ensure that it gets flushed before the secondRemotePrintcallFileLogStorageTest#remotingwas passing consistently
I then went back and tested the same test changes without this PR.
- Initially,
FileLogStorageTest#remotingwas flaky, because the listeners in the twoRemotePrintcalls were being flushed in random order - I added a 10 second sleep call after the two
RemotePrintcalls (to give the timer task inFlushReftime to runFileLogStorageTest#remotingwas now passing consistently
- I then removed all calls to
GC, so that the test no longer calledSystem#gcat allFileLogStorageTest#remotingwas still passing consistently (!)
- I then deleted the 10 second sleep after the two
RemotePrintcalls, and instead added a 10 second sleep inside of the twoRemotePrintcalls, at the end of which I printed a log message that referenced thelistener, to ensure that it could not be GC'd in that time frameFileLogStorageTest#remotingwas still passing consistently, and the log messages showed thatFlushRefwas running the cleanup task while the stream was still reachable (!)
So, here are my key takeaways:
- My suspicions in https://github.com/jenkinsci/workflow-api-plugin/pull/83#discussion_r2018964003 seem to be correct: The current behavior of
GCFlushedOutputStreamis a 1-time flush within 10 seconds of the the stream's creation, regardless of the stream's reachability. I suspect the reason that #83 fixed JENKINS-54566 was just that it deleted the problematicfinalizeoverride. - I can create a test that specifically requires
GCFlushedOutputStreamforFileLogStorageby usingDelayBufferedOutputStream.Tuningto disable the auto-flushing and adding an option toRemotePrintto not flush after writing, but it will be a pretty artificial testpipeline-cloudwatch-logswould pass this test despite having no equivalent toGCFlushedOutputStream, since the only buffer that plugin uses is aLineTransformationOutputStream, so we'd have to write a line with no line ending to see a difference in behavior, but that starts to feel extremely artificialcloudbees-pipeline-explorershould pass this test due to reuse ofBufferedBuildListener- IDK about
opentelemetry
- It is not clear to me whether there is a real-world case where
GCFlushedOutputStreamactually matters. Given that we haven't run into any known problems since #83, I am still somewhat inclined to just delete this class.
I am still somewhat inclined to just delete this class
Go ahead if you think nothing would be harmed. I only dimly remember the original justification. These days I am mostly working in USE_WATCHING mode in workflow-durable-task-step (I should probably just go ahead and make that the default), which can lose a few lines across controller restarts but I never found a way to prevent that without degrading performance.