azure-sdk-for-java
azure-sdk-for-java copied to clipboard
[FEATURE REQ] Expose metrics from AMQP SDKs
More context in #25604 #25603
In this issue, we'd need to
- define a few generally useful metrics for AMQP libraries, e.g.
- number of events sent/received/processed/checkpointed
- connection/link events counters
- offset lag for received/checkpointed events
- dimensions for all metrics (e.g. partitionId, etc)
- Implement them in Event Hubs and/or ServiceBus SDKs
- Document and rely on them in troubleshooting guides/TSG.
It gives customers signals to investigate some configuration and transient network issues and gives us more context with potential SDK issues.
cc: @stliu , @saragluna .
FYI: Maybe we can support it in our layer after sdk supported this feature.
@lmolkova Would you mind describing how offset lag for received/checkpointed events would be computed?
A few months ago, I filled https://github.com/Azure/azure-sdk-for-java/issues/19391 and ended up implementing some code to track the lag in offset and seconds of received events. Basically, it's something like lastEnqueuedProperties.getOffset( ).doubleValue( ) - eventContext.getEventData( ).getOffset( ).doubleValue( )
which is likely similar to the received part of this issue.
This works quite well for some class of problems. However, it doesn't cover the case where for some reason the application stop consuming message from a partition. Would the checkpointed events
part solve this shortcoming? Or does it boils down to updating a metrics when the checkpoint of a specific partition is updated by its lease owner? Later option would share the same limitation than my current implementation: no being able to detect that consumption (partially) stopped for whatever reason.
Done in #31024, #31283, #30583.
Metrics specs: https://gist.github.com/lmolkova/489a2b280b8fa68e4c3780c2afaa3b39
Documentation is tracked in #30562