dd-trace-rb icon indicating copy to clipboard operation
dd-trace-rb copied to clipboard

Report Linux native thread ids in profiles for Ruby < 3.1

Open ivoanjo opened this issue 1 year ago • 1 comments

What does this PR do?:

This PR adds support for reading the actual native thread id (tid) for Ruby threads in Ruby < 3.1.

Starting in Ruby 3.1, Ruby stores the tid for every thread it creates, and even exposes a Thread#native_thread_id with this information.

The tid is relevant because it's what shows up in OS-level tools, such as task managers, or even the Datadog native profiler, thus enabling the correlation of what a thread was doing from both inside and outside the VM.

As mentioned above, up until now, the profiler was only able to report this tid for Ruby 3.1+. For older Rubies, we used a fallback id, which was still as unique as the tid, but was not useful when correlating what a thread was doing with other tools.

This profiler introduces a new LinuxTidFallback class, which, together with some dark magic implemented in linux_tid_from_pthread.c (see comments on that file for details), is able to retrieve the tid for threads, even though the Linux libc developers really don't want us to do that.

Motivation:

Having a tid, as I mentioned above, allows us to correlate what happens inside the Ruby VM with what external tools can observe as well.

The tid is especially visible in the upcoming profiler timeline feature, which is why I decided to finally take a stab at improving this as a R&D week project.

Additional Notes:

This new feature is on by default, but can be disabled via configuration.

My intention with making it configurable is as a "just-in-case" lever for support; I do not recommend disabling it.

Although the profiler only officially supports Linux, we like to keep it running on macOS and others for ease of development and experimentation. This feature appropriately degrades on those OSs and disables itself.

There's a few situations, Linux-permissions-wise, where this approach may not work. In those cases (running in CircleCI is one of them), we also degrade gracefully back to the previous behavior.

How to test the change?:

This change includes test coverage. You can also see it running by running the profiler on Linux + Ruby < 3.1 and observing that the tids reported are actual linux tids that are seen in any task manager tool such as htop (do remember to turn on the feature to see individual threads, since most tools don't show them by default).

ivoanjo avatar Aug 17 '23 13:08 ivoanjo

Thanks for the patience!

I've replied to all the notes + applied a few suggested changes. I've also gotten a :+1: from @nsavoire on the weird C bits, so if y'all are happy with the Ruby bits, I think it's time to get this show on the road.

ivoanjo avatar Aug 23 '23 14:08 ivoanjo

It's been almost a year since I've opened this. This was an R&D week thing, and I ended up not getting back to it, and then kinda fell in the background.

In the meanwhile, two things happened:

  • Ruby added M:N threads (https://bugs.ruby-lang.org/issues/19842) which will make native thread ids be way less relevant going forward (at least if it becomes adopted by the community -- it's unclear when that'll happen)
  • I've learned from @sanchda there's actually alternative ways of doing what process_vm_readv does, that we could use as a fallback when process_vm_readv is not available/disabled.

So, rather than leave this PR in eternal zombie maybe-I'll-get-back-to-it-maybe-I-won't, I'll close it for now. It'll stay behind as documenting the idea, and in the future we may consider resurrecting in some form or other.

Thanks everyone for the amazing feedback, btw! I learned a lot going through the discussion above.

ivoanjo avatar Jul 02 '24 09:07 ivoanjo