opentelemetry-ruby-contrib feat: Add :force_flush option to Resque instrumentation

Summary

When a Resque worker picks up a job from the queue, it forks off a child process to perform the workload. The child then exits after it has finished performing the job and executed any defined job hooks.

The problem with this is that the child process is killed before the OTEL span_processor (in my case, the OTEL exporter's BatchProcessor) can flush its spans. Therefore the send span is exported from the parent process (the worker), but no process span is exported from the worker's forked child process.

The implementation has been adopted from the ActiveJob instrumentation. Code Link Here

Associated PR: https://github.com/open-telemetry/opentelemetry-ruby/pull/726

Related Conversation

https://cloud-native.slack.com/archives/C01NWKKMKMY/p1660818697973079

Aug 20 '22 17:08 nvolker

Just checking why Ruby 3.1 is sad with me:

  1) Error:
OpenTelemetry::Instrumentation::Resque::Patches::ResqueJob::#perform::force_flush::true#test_0001_does forcibly flush the tracer:
MockExpectationError: No more expects available for :force_flush: [] {}
    /home/runner/work/opentelemetry-ruby-contrib/opentelemetry-ruby-contrib/instrumentation/resque/lib/opentelemetry/instrumentation/resque/patches/resque_job.rb:60:in `perform'
    /home/runner/work/opentelemetry-ruby-contrib/opentelemetry-ruby-contrib/instrumentation/resque/test/opentelemetry/instrumentation/resque/patches/resque_job_test.rb:229:in `work_off_jobs'
    /home/runner/work/opentelemetry-ruby-contrib/opentelemetry-ruby-contrib/instrumentation/resque/test/opentelemetry/instrumentation/resque/patches/resque_job_test.rb:216:in `block (6 levels) in <top (required)>'
    /home/runner/work/opentelemetry-ruby-contrib/opentelemetry-ruby-contrib/instrumentation/resque/test/opentelemetry/instrumentation/resque/patches/resque_job_test.rb:215:in `block (5 levels) in <top (required)>'

23 runs, 57 assertions, 0 failures, 1 errors, 0 skips

https://github.com/open-telemetry/opentelemetry-ruby-contrib/runs/7973702409?check_suite_focus=true

Aug 23 '22 14:08 nvolker

@fbogsany, @ahayworth would either of you mind approving the running of the workflow? 🙏

As per the commit comment (b8eb97e), I believe this has resolved the flakiness. I ran it several times locally, on Ruby 3.1, and it has stopped periodically failing (touch wood 🌳).

Aug 24 '22 09:08 nvolker

CI runs approved (I thought I did that already, weird).

Aug 24 '22 12:08 ahayworth

CI runs approved (I thought I did that already, weird).

My new commit is likely the culprit. Perhaps you have to approve CI each time?

Aug 24 '22 12:08 nvolker

Maybe... 🤷

Aug 24 '22 12:08 ahayworth

Oooof. The Minitest::Mock is still not working out for me on Ruby 3. I tried giving each scenario its own mock to try to combat this flakiness, but at this point I'm unsure.

Some input on this would be appreciated!

Aug 29 '22 13:08 nvolker

@nvolker would you mind taking a look at this again? It seems the test suite is non-deterministic and fails occasionally:

e.g. https://github.com/open-telemetry/opentelemetry-ruby-contrib/actions/runs/3790399834/jobs/6444991776#step:4:357

Dec 28 '22 04:12 arielvalentin

hey @nvolker, @arielvalentin I'm quite interested in this PR, so I took the liberty to check why the tests are flaky.

It seems that sometimes the Resque job are not processed fast enough; so that calling work_off_jobs returns another job than expected.

I've altered the before to always delete all current jobs:

before do
    ::Resque::Job.destroy(:super_urgent, ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper)
    ::Resque::Job.destroy(:super_urgent, DummyJob)
    ::Resque::Job.destroy(:super_urgent, BaggageTestingJob)
    ::Resque::Job.destroy(:super_urgent, ExceptionTestingJob)
    instrumentation.install(config)
    exporter.reset
  end

That way, I cannot reproduce the flakyness anymore. It doesn't look really nice; so hopefully that can be cleaned up a bit; but I hope my $0.02 help.

Jan 10 '23 21:01 scbjans

@arielvalentin kind request to rerun CI for this PR. I believe https://github.com/open-telemetry/opentelemetry-ruby-contrib/pull/302 should have fixed the issues.

Jan 30 '23 09:01 scbjans

@open-telemetry/ruby-contrib-maintainers kind request to rebase and trigger CI. I believe tests are fixed through #302 and this should be ready to merge.

Mar 06 '23 18:03 scbjans

@scbjans My apologies. I missed the request and allowed this PR to become stale. I have merged main and am waiting on a rebuild.

Mar 06 '23 18:03 arielvalentin

@nvolker Looks like we have linter errors on this now with latest Rubocop https://github.com/open-telemetry/opentelemetry-ruby-contrib/actions/runs/4346678153/jobs/7593036414

Mar 06 '23 18:03 arielvalentin

opentelemetry-ruby-contrib opentelemetry-ruby-contrib copied to clipboard

feat: Add :force_flush option to Resque instrumentation

Summary

Related Conversation

opentelemetry-ruby-contrib
opentelemetry-ruby-contrib copied to clipboard