opentelemetry-python icon indicating copy to clipboard operation
opentelemetry-python copied to clipboard

feat(metrics): Add fork-safety to SynchronousMeasurementConsumer

Open dshivashankar1994 opened this issue 2 months ago • 4 comments

Description

Implement post-fork reinitialization of threading locks in the metrics measurement consumer to prevent deadlocks and data duplication in forked child processes.

This change adds fork-safety mechanisms to SynchronousMeasurementConsumer by:

  • Registering fork callbacks using os.register_at_fork() to detect process forks
  • Reinitializing threading locks in child processes after fork
  • Implementing lazy storage reinitialization to prevent data duplication
  • Clearing stale async instrument references

This addresses the deadlock issue reported in Flask/Gunicorn applications with gevent workers where threads get stuck trying to acquire locks that were held during fork, causing request timeouts and memory leaks.

Closes #4345

Type of change

  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] This change requires a documentation update

How Has This Been Tested?

The fork-safety implementation has been tested with:

  • [x] ProcessPoolExecutor integration: Tested with concurrent.futures.ProcessPoolExecutor to ensure no deadlocks
  • [x] Backward compatibility: Ensured single-process applications remain unaffected

Does This PR Require a Contrib Repo Change?

  • [ ] Yes. - Link to PR:
  • [x] No.

Checklist:

  • [x] Followed the style guidelines of this project
  • [ ] Changelogs have been updated
  • [ ] Unit tests have been added
  • [ ] Documentation has been updated

Technical Implementation Details

Root Cause Analysis: The deadlock occurred because forked child processes inherited the parent's thread state, including locks that may have been held at fork time. In gevent environments, this caused threads to wait indefinitely for locks that would never be released, as described in the stack trace from issue #4345.

Solution Approach:

  1. Fork Detection: Uses os.register_at_fork(after_in_child=...) to register cleanup callbacks
  2. Lock Reinitialization: Calls _at_fork_reinit() on threading.Lock instances to reset their state
  3. Lazy Storage Cleanup: Implements _needs_storage_reinit flag to defer expensive operations until first use
  4. Data Integrity: Clears _instrument_view_instrument_matches cache to prevent duplicate metrics
  5. Async Cleanup: Resets async instruments list to avoid stale references

Performance Considerations:

  • Only registers fork handler if os.register_at_fork exists (Python 3.7+)
  • Uses lazy reinitialization to minimize fork overhead
  • Gracefully handles exceptions during reinitialization
  • Zero impact on single-process applications

This fix ensures that OpenTelemetry metrics work reliably in production environments using pre-fork server models, resolving the critical deadlock issue that was causing request timeouts and memory leaks in Flask/Gunicorn deployments.

dshivashankar1994 avatar Oct 09 '25 12:10 dshivashankar1994

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: dshivashankar1994 / name: dshivashankar (039942bada0afbe1220f2d506c9e233a254adee1)

@dshivashankar1994 Thanks for the PR but you need to sign the CLA in order to contribute to OpenTelemetry.

xrmx avatar Oct 09 '25 12:10 xrmx

@xrmx Can you take a look at the PR ? I've signed the CLA now

dshivashankar1994 avatar Nov 04 '25 16:11 dshivashankar1994

@aabmass @srikanthccv @ocelotl @codeboten Can you opine on this PR ?

dshivashankar1994 avatar Nov 21 '25 11:11 dshivashankar1994