feat(metrics): Add fork-safety to SynchronousMeasurementConsumer
Description
Implement post-fork reinitialization of threading locks in the metrics measurement consumer to prevent deadlocks and data duplication in forked child processes.
This change adds fork-safety mechanisms to SynchronousMeasurementConsumer by:
- Registering fork callbacks using
os.register_at_fork()to detect process forks - Reinitializing threading locks in child processes after fork
- Implementing lazy storage reinitialization to prevent data duplication
- Clearing stale async instrument references
This addresses the deadlock issue reported in Flask/Gunicorn applications with gevent workers where threads get stuck trying to acquire locks that were held during fork, causing request timeouts and memory leaks.
Closes #4345
Type of change
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] This change requires a documentation update
How Has This Been Tested?
The fork-safety implementation has been tested with:
- [x] ProcessPoolExecutor integration: Tested with concurrent.futures.ProcessPoolExecutor to ensure no deadlocks
- [x] Backward compatibility: Ensured single-process applications remain unaffected
Does This PR Require a Contrib Repo Change?
- [ ] Yes. - Link to PR:
- [x] No.
Checklist:
- [x] Followed the style guidelines of this project
- [ ] Changelogs have been updated
- [ ] Unit tests have been added
- [ ] Documentation has been updated
Technical Implementation Details
Root Cause Analysis: The deadlock occurred because forked child processes inherited the parent's thread state, including locks that may have been held at fork time. In gevent environments, this caused threads to wait indefinitely for locks that would never be released, as described in the stack trace from issue #4345.
Solution Approach:
- Fork Detection: Uses
os.register_at_fork(after_in_child=...)to register cleanup callbacks - Lock Reinitialization: Calls
_at_fork_reinit()on threading.Lock instances to reset their state - Lazy Storage Cleanup: Implements
_needs_storage_reinitflag to defer expensive operations until first use - Data Integrity: Clears
_instrument_view_instrument_matchescache to prevent duplicate metrics - Async Cleanup: Resets async instruments list to avoid stale references
Performance Considerations:
- Only registers fork handler if
os.register_at_forkexists (Python 3.7+) - Uses lazy reinitialization to minimize fork overhead
- Gracefully handles exceptions during reinitialization
- Zero impact on single-process applications
This fix ensures that OpenTelemetry metrics work reliably in production environments using pre-fork server models, resolving the critical deadlock issue that was causing request timeouts and memory leaks in Flask/Gunicorn deployments.
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: dshivashankar1994 / name: dshivashankar (039942bada0afbe1220f2d506c9e233a254adee1)
@dshivashankar1994 Thanks for the PR but you need to sign the CLA in order to contribute to OpenTelemetry.
@xrmx Can you take a look at the PR ? I've signed the CLA now
@aabmass @srikanthccv @ocelotl @codeboten Can you opine on this PR ?