FluidFramework
FluidFramework copied to clipboard
Azure client telemetry
Adding some basic azure-client perf telemetry. When azure-client users look at FF telemetry (https://fluidframework.com/docs/testing/telemetry/) they should be able to quickly understand/filter telemetry around concepts & API surface they are familiar with: azure-client, fluid container, audience etc. With this PR we are adding azure-client telemetry under “AzureClient” namespace. No explicit opt-in is required. Follow up items: - Documenting useful telemetry (error types etc) and linking that to specific actions. - FluidContainer telemetry (we need to surface errors on IFluidContainer, at the same time) We still need to explore how we can make telemetry more effective or useful for partners, service developers etc. For example, service partners may want to run stress/perf scenario and look at our telemetry for service level indicators. Those “usefulness” explorations we can run through custom loggers (for now) that know how to make sense of existing telemetry and deliver relevant/friendly data points. After that round of research, will have more clarity what feature work is needed for further telemetry improvements.
i worry about testability and maintainability here. it been our stance that telemetry is for diagnostic only, and there are no guarantees on stability. I'd worry if we make other guarantees on some telemetry and not others it will not be clears whats supported and what's not. i think we need a fundamentally different mechanism to emit strong events in some form that we test and support if we want customers to consume them.
to start i would probably take an outside in approach, rather than an inside out approach, by that i mean i would leverage our existing observability points (IContainer events or azure client equivalent) and build something that tracks, and logs based on those. This also ensures that the logging is directly correlatable to public apis, so customers could easily and directly take action based on the data, rather than having to figure out indirect mapping. With the outside in approach, you can still emit logged events, but I would probably also send the logs under a new category and new namespace from this layer, and not reuse any existing category to make it clear these logs are different.
i worry about testability and maintainability here. it been our stance that telemetry is for diagnostic only, and there are no guarantees on stability. I'd worry if we make other guarantees on some telemetry and not others it will not be clears whats supported and what's not. i think we need a fundamentally different mechanism to emit strong events in some form that we test and support if we want customers to consume them.
to start i would probably take an outside in approach, rather than an inside out approach, by that i mean i would leverage our existing observability points (IContainer events or azure client equivalent) and build something that tracks, and logs based on those. This also ensures that the logging is directly correlatable to public apis, so customers could easily and directly take action based on the data, rather than having to figure out indirect mapping. With the outside in approach, you can still emit logged events, but I would probably also send the logs under a new category and new namespace from this layer, and not reuse any existing category to make it clear these logs are different.
Thanks for the feedback here @anthony-murphy. We will pivot this through https://dev.azure.com/fluidframework/internal/_workitems/edit/1827 to capture broader goals. The overall goal here is for us to have a category of logs directly correlatable to public apis. I'll chat with you on this separately, I know you have bunch of thoughts & ideas on this topic.