opentelemetry-rust icon indicating copy to clipboard operation
opentelemetry-rust copied to clipboard

Recover Metrics Sum HashMap from Mutex poisoning during lock.

Open lalitb opened this issue 1 year ago • 0 comments

In this failure path, does the measurement just get dropped? The only reason why a mutex lock would fail is if the mutex was poisoned. Reading the documentation on mutex poisoning

Once a mutex is poisoned, all other threads are unable to access the data by default as it is likely tainted (some invariant is not being upheld).

We should probably account for this error state and try to recover from the failure. A mutex is usually poisoned because another thread panicked while holding the lock. Since the underlying data this mutex is protecting is a map of measurements, I don't think there are external invariants that would be affected if we re-acquired the guard from within the mutex. This can be done using the into_inner method on the PoisonError type. This would get us access to the underlying data to update the measurement if we want to, but we would still need to re-instantiate the bucket with a new mutex.

Since into_inner gives us access to the inner hashmap, we can probably mem::replace the specific bucket with a fresh mutex with the existing data in the error path.

This would definitely be a performance hit, but I think it's worth it because of the data integrity it affords.

Originally posted by @bIgBV in https://github.com/open-telemetry/opentelemetry-rust/pull/1564#discussion_r1509814519

lalitb avatar Mar 02 '24 08:03 lalitb