opentelemetry-ruby-contrib
opentelemetry-ruby-contrib copied to clipboard
feat: Add :force_flush option to Resque instrumentation
Summary
When a Resque worker picks up a job from the queue, it forks off a child process to perform the workload. The child then exits after it has finished performing the job and executed any defined job hooks.
The problem with this is that the child process is killed before the OTEL span_processor (in my case, the OTEL exporter's BatchProcessor) can flush its spans. Therefore the send span is exported from the parent process (the worker), but no process span is exported from the worker's forked child process.
The implementation has been adopted from the ActiveJob instrumentation. Code Link Here
Associated PR: https://github.com/open-telemetry/opentelemetry-ruby/pull/726
Related Conversation
https://cloud-native.slack.com/archives/C01NWKKMKMY/p1660818697973079
Just checking why Ruby 3.1 is sad with me:
1) Error:
OpenTelemetry::Instrumentation::Resque::Patches::ResqueJob::#perform::force_flush::true#test_0001_does forcibly flush the tracer:
MockExpectationError: No more expects available for :force_flush: [] {}
/home/runner/work/opentelemetry-ruby-contrib/opentelemetry-ruby-contrib/instrumentation/resque/lib/opentelemetry/instrumentation/resque/patches/resque_job.rb:60:in `perform'
/home/runner/work/opentelemetry-ruby-contrib/opentelemetry-ruby-contrib/instrumentation/resque/test/opentelemetry/instrumentation/resque/patches/resque_job_test.rb:229:in `work_off_jobs'
/home/runner/work/opentelemetry-ruby-contrib/opentelemetry-ruby-contrib/instrumentation/resque/test/opentelemetry/instrumentation/resque/patches/resque_job_test.rb:216:in `block (6 levels) in <top (required)>'
/home/runner/work/opentelemetry-ruby-contrib/opentelemetry-ruby-contrib/instrumentation/resque/test/opentelemetry/instrumentation/resque/patches/resque_job_test.rb:215:in `block (5 levels) in <top (required)>'
23 runs, 57 assertions, 0 failures, 1 errors, 0 skips
https://github.com/open-telemetry/opentelemetry-ruby-contrib/runs/7973702409?check_suite_focus=true
@fbogsany, @ahayworth would either of you mind approving the running of the workflow? 🙏
As per the commit comment (b8eb97e), I believe this has resolved the flakiness. I ran it several times locally, on Ruby 3.1, and it has stopped periodically failing (touch wood 🌳).
CI runs approved (I thought I did that already, weird).
CI runs approved (I thought I did that already, weird).
My new commit is likely the culprit. Perhaps you have to approve CI each time?
Maybe... 🤷
Oooof. The Minitest::Mock is still not working out for me on Ruby 3. I tried giving each scenario its own mock to try to combat this flakiness, but at this point I'm unsure.
Some input on this would be appreciated!
@nvolker would you mind taking a look at this again? It seems the test suite is non-deterministic and fails occasionally:
e.g. https://github.com/open-telemetry/opentelemetry-ruby-contrib/actions/runs/3790399834/jobs/6444991776#step:4:357
hey @nvolker, @arielvalentin I'm quite interested in this PR, so I took the liberty to check why the tests are flaky.
It seems that sometimes the Resque job are not processed fast enough; so that calling work_off_jobs returns another job than expected.
I've altered the before to always delete all current jobs:
before do
::Resque::Job.destroy(:super_urgent, ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper)
::Resque::Job.destroy(:super_urgent, DummyJob)
::Resque::Job.destroy(:super_urgent, BaggageTestingJob)
::Resque::Job.destroy(:super_urgent, ExceptionTestingJob)
instrumentation.install(config)
exporter.reset
end
That way, I cannot reproduce the flakyness anymore. It doesn't look really nice; so hopefully that can be cleaned up a bit; but I hope my $0.02 help.
@arielvalentin kind request to rerun CI for this PR. I believe https://github.com/open-telemetry/opentelemetry-ruby-contrib/pull/302 should have fixed the issues.
@open-telemetry/ruby-contrib-maintainers kind request to rebase and trigger CI. I believe tests are fixed through #302 and this should be ready to merge.
@scbjans My apologies. I missed the request and allowed this PR to become stale. I have merged main and am waiting on a rebuild.
@nvolker Looks like we have linter errors on this now with latest Rubocop https://github.com/open-telemetry/opentelemetry-ruby-contrib/actions/runs/4346678153/jobs/7593036414