Fix OpenTelemetry
Currently, the metric exporter fails regularly (during each export), which prevents the metrics from showing up in the dashboard and spams the logs on the device.
An upgrade of the opentelemetry library to 0.28 and beyond might fix this, as it reworks how the blocking/async works.
Migration guide: https://github.com/open-telemetry/opentelemetry-rust/blob/main/docs/migration_0.28.md
Example logs:
May 05 09:01:56 opendut-pi opendut-edgar[21753]: Failed to acquire lock on confidential client in telemetry request interceptor. Retrying to get access token after 1s.
May 05 09:01:57 opendut-pi opendut-edgar[21753]: Failed to acquire lock on confidential client in telemetry request interceptor. Retrying to get access token after 2s.
May 05 09:01:59 opendut-pi opendut-edgar[21753]: Failed to acquire lock on confidential client in telemetry request interceptor. Retrying to get access token after 4s.
May 05 09:02:03 opendut-pi opendut-edgar[21753]: Failed to acquire lock on the Confidential Client definitively. The following telemetry request will not be transmitted.
May 05 09:02:03 opendut-pi opendut-edgar[21753]: Failed request: Request { metadata: MetadataMap { headers: {"user-agent": "OTel OTLP Exporter Rust/0.27.0"} }, message: (), extensions: Extensions }
May 05 09:02:03 opendut-pi opendut-edgar[21753]: 2025-05-05T07:02:03.090667Z ERROR opentelemetry_sdk: name="PeriodicReader.ExportFailed" Failed to export metrics reason="Metrics error: unexpected status while exporting Status { code: Unauthenticated, message: \"FailedToUpdateToken: Unable to acquire lock on the Confidential Client cause: operation would block.\", source: None }"
May 05 09:02:03 opendut-pi opendut-edgar[21753]: Failed to acquire lock on confidential client in telemetry request interceptor. Retrying to get access token after 1s.
May 05 09:02:03 opendut-pi opendut-edgar[21753]: Failed to acquire lock on confidential client in telemetry request interceptor. Retrying to get access token after 1s.
May 05 09:02:04 opendut-pi opendut-edgar[21753]: Acquired lock on confidential client after <1> retries and <1> seconds.
May 05 09:02:04 opendut-pi opendut-edgar[21753]: Failed to acquire lock on confidential client in telemetry request interceptor. Retrying to get access token after 2s.
May 05 09:02:06 opendut-pi opendut-edgar[21753]: Acquired lock on confidential client after <2> retries and <3> seconds.
May 05 09:02:26 opendut-pi opendut-edgar[21753]: 2025-05-05T07:02:26.090662Z ERROR opentelemetry_sdk: name="PeriodicReader.ExportFailed" Failed to export metrics reason="Metrics error: export timed out"
May 05 09:02:56 opendut-pi opendut-edgar[21753]: Failed to acquire lock on confidential client in telemetry request interceptor. Retrying to get access token after 1s.
May 05 09:02:57 opendut-pi opendut-edgar[21753]: Acquired lock on confidential client after <1> retries and <1> seconds.
May 05 09:03:26 opendut-pi opendut-edgar[21753]: 2025-05-05T07:03:26.093084Z ERROR opentelemetry_sdk: name="PeriodicReader.ExportFailed" Failed to export metrics reason="Metrics error: export timed out"
May 05 09:04:26 opendut-pi opendut-edgar[21753]: 2025-05-05T07:04:26.089264Z ERROR opentelemetry_sdk: name="PeriodicReader.ExportFailed" Failed to export metrics reason="Metrics error: export timed out"
May 05 09:05:26 opendut-pi opendut-edgar[21753]: 2025-05-05T07:05:26.095474Z ERROR opentelemetry_sdk: name="PeriodicReader.ExportFailed" Failed to export metrics reason="Metrics error: export timed out"
May 05 09:06:26 opendut-pi opendut-edgar[21753]: 2025-05-05T07:06:26.094864Z ERROR opentelemetry_sdk: name="PeriodicReader.ExportFailed" Failed to export metrics reason="Metrics error: export timed out"
May 05 09:06:56 opendut-pi opendut-edgar[21753]: Failed to acquire lock on confidential client in telemetry request interceptor. Retrying to get access token after 1s.
May 05 09:06:57 opendut-pi opendut-edgar[21753]: Failed to acquire lock on confidential client in telemetry request interceptor. Retrying to get access token after 2s.
May 05 09:06:59 opendut-pi opendut-edgar[21753]: Failed to acquire lock on confidential client in telemetry request interceptor. Retrying to get access token after 4s.
May 05 09:07:03 opendut-pi opendut-edgar[21753]: Failed to acquire lock on the Confidential Client definitively. The following telemetry request will not be transmitted.
May 05 09:07:03 opendut-pi opendut-edgar[21753]: Failed request: Request { metadata: MetadataMap { headers: {"user-agent": "OTel OTLP Exporter Rust/0.27.0"} }, message: (), extensions: Extensions }
May 05 09:07:03 opendut-pi opendut-edgar[21753]: 2025-05-05T07:07:03.090253Z ERROR opentelemetry_sdk: name="PeriodicReader.ExportFailed" Failed to export metrics reason="Metrics error: unexpected status while exporting Status { code: Unauthenticated, message: \"FailedToUpdateToken: Unable to acquire lock on the Confidential Client cause: operation would block.\", source: None }"
The error occurs here: https://github.com/eclipse-opendut/opendut/blob/aac72c96144b11305b108802295a4e14a88d8895/opendut-util/opendut-auth/src/confidential/client.rs#L204 Possibly due to two clients trying to send metrics at the same time.
The root cause of this is mixing async and sync code (bc. we didn't like the code duplication) and the fact that opentelemetry does not support an async interceptor at the moment. Similar issue here.
This was working as long we had the blocking version of the confidential client. Compare revission bf5eb295b81e5082f1e504e86b628198b4f31782. Particularly the client
I think the problem only occurs when the token needs refreshing. Maybe it is particularly bad in the given situation due to bad network connectivity.
We could avoid this situation by updating it manually:
- add a background task to the client to refresh the token in time (30 seconds before it expires)
- increase token validity in keycloak to 30 minutes
Original code was mixing async and sync
https://github.com/eclipse-opendut/opendut/blob/a9ec115ac7487fc7c8e4e0072e9d5c02dea4bc34/opendut-util/opendut-auth/src/confidential/client.rs#L218-L231
This has caused a deadlock in some situations when the identity provider did not respond fast enough.
Solution
-
Update opentelemetry from 0.27 -> 0.30
-
Use the fact that the async runtime requirement for opentelemetry was lifted
These components have been re-written to no longer require an async runtime. Instead they operate by spawning dedicated background thread, and making blocking calls from the same.
-
Provide tokio runtime handle to the interceptor
et handle = Handle::current(); et confidential_client = ConfClientArcMutex { mutex: Arc::new(Mutex::new(confidential_client)), handle }; -
Lift the requirement to create a runtime within an existing runtime and instead use a handle to the runtime where the client was created here
et token = self.handle.block_on(async move { confidential_client.get_token().await ); -
Remove tokio::task::block_in_place since it is no longer applicable with the current opentelemetry version
-
Moved
confidential_client.get_token()method to the repeated operation by backon to release the mutex lock as soon as possible.