envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Fix SIGABRT when destroying OdCDS handles on worker threads

Open wdauchy opened this issue 1 month ago • 4 comments

Commit Message: The OdCdsApiHandleImpl destructor can be invoked on worker threads when filters are removed (e.g., during VHDS filter updates). When the last reference to an OdCdsApiSharedPtr is destroyed on a worker thread, the subscription cleanup code attempts to run on the wrong thread, causing an assertion failure or SIGABRT since subscription operations must execute on the main thread.

This fix adds a destructor to OdCdsApiHandleImpl that checks if destruction is happening on a worker thread. If so, it dispatches the OdCdsApiSharedPtr destruction to the main thread dispatcher via a lambda, ensuring the subscription is properly cleaned up on the thread where it was created.

The fix follows the same pattern used elsewhere in Envoy (e.g., ThreadLocal::SlotImpl destructor) for cross-thread cleanup operations.

Additional Description:

Note: A longer-term optimization could cache and reuse OdCDS handles in ClusterManager with reference counting and deferred cleanup, avoiding unnecessary subscription teardown/recreation during filter rotations. This is tracked by the existing TODO in allocateOdCdsApi().

Risk Level: Testing: Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]

wdauchy avatar Nov 25 '25 16:11 wdauchy

As a reminder, PRs marked as draft will not be automatically assigned reviewers, or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/42253 was opened by wdauchy.

see: more, trace.

@adisuissa as our xDS expert

botengyao avatar Nov 26 '25 15:11 botengyao

@adisuissa i think this is awaiting your review

phlax avatar Dec 09 '25 17:12 phlax

@adisuissa i think this is awaiting your review

we discussed some aspects offline. The last approach is fixing the sigabort (no longer accessing subscription from a worker) but it now keeps the object until envoy is restarted. I had in mind we could merge this fix and then followup with an implementation to add refcount on the object so it can be cleaned up along the way. I will see what Adi think about it.

wdauchy avatar Dec 09 '25 19:12 wdauchy