workflow-api-plugin icon indicating copy to clipboard operation
workflow-api-plugin copied to clipboard

Ensure log streams are flushed before garbage collection when using `GCFlushedOutputStream` by switching to `Cleaner`

Open dwnusbaum opened this issue 8 months ago • 3 comments

See https://github.com/jenkinsci/workflow-api-plugin/pull/83#discussion_r2018964003. Alternative to https://github.com/jenkinsci/workflow-api-plugin/pull/387.

Testing done

See new automated tests. I ran them manually against pipeline-cloudwatch-logs, which does not use the classes changed by this PR and so is unaffected (the new tests are currently skipped, but they pass even if not skipped), as well as CloudBees Pipeline Explorer, which passes with the new tests.

Submitter checklist

  • [x] Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • [x] Ensure that the pull request title represents the desired changelog entry
  • [x] Please describe what you did
  • [ ] Link to relevant issues in GitHub or Jira
  • [x] Link to relevant pull requests, esp. upstream and downstream changes
  • [x] Ensure you have provided tests - that demonstrates feature works or fixes the issue

dwnusbaum avatar Mar 28 '25 18:03 dwnusbaum

run it through the PCT

and pipeline-cloudwatch-logs (need an AWS account to run and some vars like AWS_PROFILE, AWS_REGION, AWS_CHAINED_ROLE, CLOUDWATCH_LOG_GROUP_NAME)

jglick avatar Apr 08 '25 17:04 jglick

I finally had a bit of time to test this. As far as I can see, pipeline-cloudwatch-logs does not care either way, because it does not use BufferedBuildListener, GCFlushedOutputStream, or DelayBufferedOutputStream. I did run its tests though and they passed with this PR as well as #387. Tests in workflow-durable-task-step also passed with both PRs.

The only way I could figure out to make LogStorageTestBase#remoting require GCFlushedOutputStream to pass with this PR was as follows:

  • Replace all uses of DelayBufferedOutputStream with BufferedOutputStream
    • FileLogStorageTest#remoting was still passing
  • Remove the call to flush in LogStorageTestBase$RemotePrint
    • FileLogStorageTest#remoting was failing consistently because the remote log output was never received by the controller
  • Adjust LogStorageTestBase$GC to call System#gc repeatedly in a loop for around 5 seconds so that the cleanup task would run (I tried to investigate why this was needed, but acquiring a heap dump consistently triggered the cleanup task immediately. Calling System#gc once and sleeping for a long time did not consistently trigger cleanup)
    • FileLogStorageTest#remoting was passing, but flaky, because the listeners in the two RemotePrint calls were being flushed in random order
  • Add a call to GC after the first RemotePrint call, to ensure that it gets flushed before the second RemotePrint call
    • FileLogStorageTest#remoting was passing consistently

I then went back and tested the same test changes without this PR.

  • Initially, FileLogStorageTest#remoting was flaky, because the listeners in the two RemotePrint calls were being flushed in random order
  • I added a 10 second sleep call after the two RemotePrint calls (to give the timer task in FlushRef time to run
    • FileLogStorageTest#remoting was now passing consistently
  • I then removed all calls to GC, so that the test no longer called System#gc at all
    • FileLogStorageTest#remoting was still passing consistently (!)
  • I then deleted the 10 second sleep after the two RemotePrint calls, and instead added a 10 second sleep inside of the two RemotePrint calls, at the end of which I printed a log message that referenced the listener, to ensure that it could not be GC'd in that time frame
    • FileLogStorageTest#remoting was still passing consistently, and the log messages showed that FlushRef was running the cleanup task while the stream was still reachable (!)

So, here are my key takeaways:

  • My suspicions in https://github.com/jenkinsci/workflow-api-plugin/pull/83#discussion_r2018964003 seem to be correct: The current behavior of GCFlushedOutputStream is a 1-time flush within 10 seconds of the the stream's creation, regardless of the stream's reachability. I suspect the reason that #83 fixed JENKINS-54566 was just that it deleted the problematic finalize override.
  • I can create a test that specifically requires GCFlushedOutputStream for FileLogStorage by using DelayBufferedOutputStream.Tuning to disable the auto-flushing and adding an option to RemotePrint to not flush after writing, but it will be a pretty artificial test
    • pipeline-cloudwatch-logs would pass this test despite having no equivalent to GCFlushedOutputStream, since the only buffer that plugin uses is a LineTransformationOutputStream, so we'd have to write a line with no line ending to see a difference in behavior, but that starts to feel extremely artificial
    • cloudbees-pipeline-explorer should pass this test due to reuse of BufferedBuildListener
    • IDK about opentelemetry
  • It is not clear to me whether there is a real-world case where GCFlushedOutputStream actually matters. Given that we haven't run into any known problems since #83, I am still somewhat inclined to just delete this class.

dwnusbaum avatar Jul 03 '25 23:07 dwnusbaum

I am still somewhat inclined to just delete this class

Go ahead if you think nothing would be harmed. I only dimly remember the original justification. These days I am mostly working in USE_WATCHING mode in workflow-durable-task-step (I should probably just go ahead and make that the default), which can lose a few lines across controller restarts but I never found a way to prevent that without degrading performance.

jglick avatar Jul 14 '25 19:07 jglick