dd-trace-rb icon indicating copy to clipboard operation
dd-trace-rb copied to clipboard

Formalize Barrier behavior during waiting

Open p-datadog opened this issue 1 year ago • 1 comments

What does this PR do?

This PR modifies the Barrier class to have, in my opinion, more understandable/easier to reason about timeout behavior during the waits.

The new behavior is as follows:

  • A timeout can be defined for the barrier. Any wait will wait up to this timeout from the moment the barrier is created (and its corresponding operation, presumably, started).
  • If a timeout is provided for the individual wait operation, the wait will be no longer than this timeout (but would not end later than barrier creation + barrier timeout).
  • If multiple waits are done, they will continue waiting until the barrier timeout elapses from barrier construction (or until the individual wait timeouts, whichever is shorter).

Previous behavior:

  • A timeout can be defined for the barrier, but this timeout is applied to the wait operations, not to the barrier operation.
  • When waiting, the wait timeout replaces the barrier timeout, even if the wait timeout is longer.
  • The wait is always waiting up to the full selected timeout, even if the barrier was created (and thus barrier operation started) a long time ago.
  • Upon completion of (any) wait, the barrier is marked as having been lifted.

Examples

  1. Start operation with 2 second timeout, after 1 second, wait for it.

New behavior: after 1 second of waiting, control returns to caller. Previous behavior: after 2 seconds of waiting, control returns to caller.

  1. Start operation with 2 second timeout, after 3 seconds, wait for it.

New behavior: control returns instantly. Previous behavior: after 2 seconds of waiting, control returns to caller.

  1. Start operation with 2 second timeout, wait for it with 1 second timeout, after 1 second second thread waits with 1 second timeout.

New behavior: Each waiting thread waits 1 second, then continues. Previous behavior: First thread waits 1 second, second thread continues immediately.

Motivation:

The existing behavior is I think confusing and, although probably does not present issues given the class is used in a very limited way by dd-trace-rb, may cause issues if the usage of Barrier was expanded.

Additional Notes:

This PR is part of the fix for thread leaks which will be in a follow-up PR.

How to test the change?

This PR adds several unit tests which check elapsed time of wait operations. The shortest timeouts are set to 0.25 seconds, which is hopefully sufficient to not produce flakiness in CI while keeping the test runs quick.

The entire barrier_spec.rb takes 7 seconds to execute currently.

p-datadog avatar Feb 16 '24 18:02 p-datadog

Thank you for the review @ivoanjo , I discussed this with @lloeki and this PR definitely needs more work.

At a minimum, my assumption of when the "work" process starts was incorrect; to continue in the spirit of this PR I would need to add a method that would start the "timeout from work start". Construction time isn't at all correct to take for the moment when work starts.

Second issue that @lloeki raised was that in the existing implementation, waiting threads wouldn't all be unblocked together, but in the implementation proposed in this PR they would be (as soon as the work timeout expires). Unblocking all threads at the same time was not an appealing idea.

And generally, whether the overall approach taken by this PR was the desired one (i.e. whether timeouts should count from when the work started or from when the waits started), is up for debate.

I believe this PR is orthogonal to the thread leak issue. I opened it because I thought it would make the code easier to reason about but given the feedback perhaps the barrier can be left as is for now and I'll be focusing on the start/stop calls right now.

p-datadog avatar Feb 22 '24 17:02 p-datadog