Add goals for a OpenTelemetry Desktop Viewer/Development Tool

Open austinlparker opened this issue 1 year ago • 29 comments

Per the discussion in https://github.com/open-telemetry/community/issues/1515, I have created this OTEP as a way to gather requirements and build agreement towards what a 'desktop viewer' for OpenTelemetry should be.

Jun 06 '23 15:06 austinlparker

There is a prototype that may be close to what this OTEP looks for: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/19634

Jun 12 '23 16:06 tigrannajaryan

This might change the charter and scope of OpenTelemetry (although "tools" can be extended to cover almost everything):

https://opentelemetry.io/

What do we do if later users ask for a query language support in the desktop viewer?

Jun 19 '23 19:06 reyang

This might change the charter and scope of OpenTelemetry (although "tools" can be extended to cover almost everything):

As you point out, 'tools' is deliberately broad by intention. The collector is a foundational component for a wide variety of tooling, such as tail-based sampling. Providing an open source utility for this purpose doesn't invalidate the existing and new commercial/proprietary solutions that exist; Similarly, a diagnostic tool for local query and display of telemetry doesn't prevent ecosystem development of the same. As an example, all databases have some sort of command-line utility (or even a GUI-based one), but that doesn't prevent the development of alternatives.

What do we do if later users ask for a query language support in the desktop viewer?

I think it would depend on exactly what they were asking for. I personally don't see how this utility could be created without some sort of simple query dialect (either SQL or a variant, or basic predicate matching like 'attribute.name = 'foo' && latency > 500ms'). Perhaps this dialect could resemble the telemety transform language in its syntax and semantics?

That said, the goal of this OTEP is to narrowly scope the use case for this viewing tool, which also necessarily scopes the query functionality of it to what is needed to fulfill the purpose of the OTEP.

Jun 20 '23 02:06 austinlparker

I'm really sceptic about this topic. Maybe I don't exactly see what you want to achieve because open telemetry is a set of concepts. I eventually imagine your tool as a kid of ephemeral database to explore fresh data received during the development phase ? But the performance and long term storage should not be a priority. For a generic tool, maybe the very new websocket processor in otel collector contrib could help. That could eventually avoid to implement an otlp receiver layer on the application.

But creating a "basic" logs/traces/metrics explorer seems very challenging... About the language OTTL seems a good candidate to extract data and experiment transformations.

Jun 21 '23 21:06 gillg

Following up on this issue, since it became relevant in a discussion for docs (see https://github.com/open-telemetry/opentelemetry.io/issues/3266 & https://github.com/open-telemetry/opentelemetry.io/pull/3144#discussion_r1323498276): I want to emphasize that such a tool would be greatly beneficial for the quality of our documentation and how people can get started with OpenTelemetry:

We want to promote the use of OTLP throughout our documentation, but right now we can only suggest the visualization of traces easily, as Jaeger is the only pure-OSS tool available that supports OTLP directly. So for logs & metrics we are stuck with either dumping them to stdout or to translate to prometheus, etc to have them visualized
Feedback I get a lot from end-users is that if they do not have an observability backend at hand, setting one up is rather painful if ALL you want is look at a handful of logs, metrics & traces for your "let me try out otel" experience
Having an integration with the collector would help us to create a round story in the documentation: Set up otel in your app with console exporters, switch to otlp exporters, send data to the collector, look at telemetry with OTel desktop viewer, update collector config to send telemetry to your backend of choice, everyone is happy!

Sep 13 '23 14:09 svrnm

We want to promote the use of OTLP throughout our documentation, but right now we can only suggest the visualization of traces easily, as Jaeger is the only pure-OSS tool available that supports OTLP directly. So for logs & metrics we are stuck with either dumping them to stdout or to translate to prometheus, etc to have them visualized

so here's a radical idea - why not focus the effort that would be needed to develop these extra capabilities by building them within Jaeger, by extending its scope from "just traces"?

Sep 13 '23 16:09 yurishkuro

We want to promote the use of OTLP throughout our documentation, but right now we can only suggest the visualization of traces easily, as Jaeger is the only pure-OSS tool available that supports OTLP directly. So for logs & metrics we are stuck with either dumping them to stdout or to translate to prometheus, etc to have them visualized

so here's a radical idea - why not focus the effort that would be needed to develop these extra capabilities by building them within Jaeger, by extending its scope from "just traces"?

It's more obvious than radical, looks like I didn't see the forest for the trees ... for the docs use case I outlined above this would definitely be a solution we can work with 👍

Sep 13 '23 18:09 svrnm

I started building a tool very similar to what is being talked about here, to aid .NET developers who are adopting OTEL with diagnosing what is happening. I was pointed to this thread by @reyang, and would be happy to contribute based on what I found during this investigation.

The tool I created uses the OTEL proto files to create an endpoint, listen to the data being sent, and provide basic visualizations. I happened to write this using Blazor as that was simplest for me, but it could be done using almost any stack.

The ideal usage experience is as an exe and/or container that can be run locally, and uses a browser as the UI. It provides a default OTLP endpoint on https://localhost:5317 that you can configure your app to point to via the standard env variables. Ideally its one app - it should not require installing a web server, database etc. It shouldn't need a database, it can keep a buffer in memory of the most recent telemetry its captured, and dispose of older data as needed.

Somebody I showed it to described it as WireShark for OTLP which I think is an apt description.

Here are a couple of screen shots of the UI - its ugly as I am not a designer - but I hope it gets the point across:

Logging

logs logs2

This shows the log messages that have been collected via OTLP. There it has basic filtering capability. Clicking an entry will show all the parameters. (I originally showed them all in the table, but the columns got out of hand).

Metrics

metrics

This shows the metrics that have been sent to the OTLP endpoint. You don't need to fish for them, they are shown in a list. When you select a metric, you see the dimension combinations that have been emitted, a list of recent values, and a basic graph. Nobody should confuse this with a dashboard system, but if you want to know what platform metrics you are getting, and or verify your own metrics, its ideal.

metrics2 This second view is for a metric with multiple dimension combinations, so there is a graph for each combination.

metrics3 This 3rd view is for a histogram metric, again with multiple dimension combinations. Histograms are shown with a bar chart based on the buckets.

Tracing

traces

There is not significant new ground trod here - you get a similar gant chart view similar to Jaeger or Zipkin. The difference is that all traces are listed as they are seen - no need to query for them. Trace properties are seen for the selected span.

In this case the color choice is bit odd as the test app is calling back to itself. The idea is to pick a color for each process, and use a gradient for each operation that occurs. In the test app shown, its recursive, which is why the particular bands got used.

Some spans include little diamonds, those represent the events that occurred during the span - the details are shown in the pane on the right.

Sep 13 '23 21:09 samsp-msft

What do we do if later users ask for a query language support in the desktop viewer?

Say no - a query language probably isn't required - simple filtering based on existing values is probably sufficient. Depending on implementation tech, something like https://dynamic-linq.net/ for .NET can do that for you, but should be a low priority feature.

Sep 13 '23 22:09 samsp-msft

Wanted to resurrect this OTEP a bit after some discussions at KubeCon. I'm going to briefly summarize the top-level changes/thoughts in this comment for the sake of people subscribed to this thread.

In general, I still think this is a good idea, and I also think it's a good idea to have this be a separate component owned by OpenTelemetry rather than trying to get Jaeger or some other project to build it for us. Existing tools are, more or less, good at what they do today. I wouldn't necessarily want to mix the functionality of what this can be with what other tools already are. Moreover, I don't want an OpenTelemetry component to be managed by a different project. Finally, I think one of the reasons OpenTelemetry works as well as it does is because we very explicitly do not prefer any data store, query language, etc. Having Jaeger become 'the default' would change this, even if it was just for a local-only dev experience.

The best analogue to what I have in mind is from the past, specifically, the spate of Installers and Dashboards that popped up in the early days of k8s. There were multiple competing installer scripts and dashboards that, by and large, simply don't exist any more. The Kubernetes Dashboard still exists, but all of its functionality has been absorbed into other ecosystem tools (e.g. console viewers like k9s or UIs in managed k8s environments), same with installation (such as Cluster API and kubeadm).

I view this OTEP as, conceptually, having a similar path. I want us to be able to say as a project, "hey, here is a starting point and a good default for developers and operators who are end-users to be able to see what's going on in a nice UI". You should be able to use it to get real-time feedback on OTTL transforms, on changes to environment variables, to new attributes you're adding to code. It doesn't need persistence, it doesn't need a query language, it really should just be a filterable stream. You should be able to also use this component with OpAMP, to view/read/write changes to configs. We can't rely solely on the community or vendors to create tools here -- if we do, those tools will almost certainly not be licensed in favorable ways, or perhaps will not be as vendor-agnostic as we'd like.

Anyway, please review the updated OTEP and let me know what you think.

Mar 25 '24 16:03 austinlparker

are there any production use cases you are thinking of, or can we explicitly say it's "not for production"?

By itself, I can't really think of a production use case, but I think it's worthwhile to consider that 'production' means a lot of things. Like, if I'm running a homelab with k8s that serves some public services (so it's production to me) then I could see using this to manage OpAMP configs, for example. Would I run a business on it? No.

Could other vendors or community partners come along and make this (or parts of it) into a 'production' system? Sure, in the same way that (conceptually) people took the k8s dashboard and incorporated it into their managed k8s solutions.

Mar 25 '24 17:03 austinlparker

I would say that it's not designed for production use.

Mar 25 '24 17:03 austinlparker

FYI - We added something very similar to .NET Aspire in the form of a developer dashboard for exactly these scenarios. https://devblogs.microsoft.com/dotnet/introducing-dotnet-aspire-simplifying-cloud-native-development-with-dotnet-8/#dashboard-your-central-hub-for-app-monitoring-and-inspection This has proved to be a very popular part of Aspire.

Mar 25 '24 17:03 samsp-msft

FYI - We added something very similar to .NET Aspire in the form of a developer dashboard for exactly these scenarios. https://devblogs.microsoft.com/dotnet/introducing-dotnet-aspire-simplifying-cloud-native-development-with-dotnet-8/#dashboard-your-central-hub-for-app-monitoring-and-inspection This has proved to be a very popular part of Aspire.

I believe I call out Aspire explicitly in the updated OTEP :) It's good stuff.

Mar 26 '24 13:03 austinlparker

I would, very explicitly, say this is not something we should promote for production. To the point that we explicitly say it's not for production.

The in-memory element of this makes it practically impossible, and very resource/cost intensive to run at a level that provides real benefit to production.

I've spent time with the Aspire Dashboard (the thing that @samsp-msft) mentioned, specifically looking at the production use-cases, and although they're promoting that use-case, I can say with some confidence that a decent sized site isn't going to get use out of it. I've run it with the Otel demo and it was unusable, purely down to the size of telemetry generated by even such a small site, with a small amount of load.

To go further, the majority of installations of the collector, based on the survey and my own experience with customers and developers, factors in multiple instances of the collector, which means that without a distributed datastore it isn't going to work. Given that we don't want to push for a datastore connector for it, that wouldn't make sense. At best it won't be useful, at worst it will end up causing people to think that there's a problem with their traces and logs.

I 100% support the idea behind this OTEP, I think it will be a great addition to the toolkit for Local Development use-cases for debugging, and also for thinking about how to debug telemetry in general.

Mar 26 '24 15:03 martinjt

I would, very explicitly, say this is not something we should promote for production. To the point that we explicitly say it's not for production.

I mean, I think it's useful to spell out some specific use cases here and see if we agree on what 'production' means.

In-Scope:

I'm a solo/hobbyist developer with a home lab. I have a k8s cluster with a few applications deployed. I want to set up the Collector with some data transformations, so I deploy this viewer and connect it to my Collectors in order to view the log stream and transformed data.
I'm a professional developer writing code to add OpenTelemetry instrumentation to an existing or new service. I'm running a local Collector that I'm sending metrics/logs/traces to from my service. I want to quickly see adjustments to the span attributes and new spans that I'm creating, excluding other telemetry that my service may be sending.
I'm an operator that has a self-managed production cluster or deployment of Collectors. I want a drop-in tool that can show me the data stream on an individual collector via a UI.

Out of Scope:

I'm a developer trying to monitor the performance of an application by analyzing telemetry through a dashboard.
I'm an operator trying to manage a fleet of Collector configurations via OpAMP or get their health on a long-term basis.
I'm an OpenTelemetry user trying to record data from my service and visualize it over a long period of time and make this available to other users in my organization.

I would suggest that the out of scope actions are clearly 'production' use cases, but I just want to make sure that we're ok with the in scope items being in scope and being "non-production".

Mar 26 '24 20:03 austinlparker

Phrasing it as production / non-production may not the best way to talk about it, as its not about the type of workload that its used with, instead about the purpose and type of analysis that the tool will perform:

It is for instantaneous sniffing of the data to aid developers to see what is being sent.
It is not for doing any kind of analysis that involves history - such as trends, search, alerting, comparison, auditing.

The work that we (Microsoft) are doing with the Aspire dashboard is not intended to replace Azure Monitor/Application Insights as the Azure APM solution. The "production scenarios" for using the Aspire dashboard are to aid developers in diagnosing post deployment teething problems, or bug repo scenarios. For the day-to-day monitoring, altering, problem detection, trend detection etc should use an APM such as Application Insights, Grafana etc.

Mar 26 '24 20:03 samsp-msft

the Aspire dashboard are to aid developers in diagnosing post deployment teething problems, or bug repo scenarios

That's the issue, with a real production site, a 10k circular buffer is just unusable. For a low volume hobby site, it's probably fine, but anything more than that and it's basically not useful for those scenarios. I suppose unless your site literally stops and doesn't actually serve the traffic, but there are better ways to solve that.

I say that as someone who loves the dashboard, and has spent a lot of time using it so far. Saying it's for a production deployment is a mistake in my opinion, and it's going to have more of a negative impact than positive on the whole telemetry movement.

I'm a solo/hobbyist developer with a home lab. I have a k8s cluster with a few applications deployed. I want to set up the Collector with some data transformations, so I deploy this viewer and connect it to my Collectors in order to view the log stream and transformed data.

I can see that being an ok usecase, but in that scenario, I don't think that should be a goal, or something that is actively catered for. Those people will likely do it anyway.

I'm an operator that has a self-managed production cluster or deployment of Collectors. I want a drop-in tool that can show me the data stream on an individual collector via a UI.

My issue here is that it's a stream, a fast stream, that circular buffer won't be enough to actually catch anything as you can't scroll back. The same issue as with the Aspire production scenario. The narrowness of when that usecase is valid, and when it's unusable/not useful is so small that I don't think it's a usecase that should be a goal for the project.

Mar 26 '24 20:03 martinjt

Those people will likely do it anyway.

👍

Mar 26 '24 21:03 trask

I think you're underestimating how many values can be stored in memory, but whatever, the lack of persistent storage makes it "not for production use" by default. I'd rather we focus on things that we can definitively state rather than what the definition of is, is, vis a vis "what is production"

Mar 26 '24 22:03 austinlparker

I love the idea, it's a great problem to solve. I'd join in helping with the user flows and overall UX.

Relations with other CNCF projects

For Dashboards, one could consider Perses, in candidate status. It's currently a few widgets short of what this OTEP seems to need (like heatmaps and timelines), but IMO it would be a net benefit for the community if we ended up contributing widgets to round up the usual observability visualizations.

On how to communicate about (lack of) production-readiness

About how to reduce the risk that experienced people will not use it in production, usually NOT supporting the following does the trick:

Persistence of data (already covered as of bcf3fd55b0c6e6fb2ca1abf00651165bc36574e4)
Alerting (currently not spelled out as out-of-scope AFAICT)

On the storage

About the data storage, I am wondering if we could not also store the data browser-side: the browser would retain a part of the data (a moving window with memory cap?) as they are streamed by the collector, reducing the amount of buffering needed in the collector itself.

The main side-effect would be that different use-sessions would see different subsets of data, but that seems to me like an acceptable tradeoff, and likely moves most of the complexity in the frontend, where it's (in my experience) cheaper and faster to develop and iterate.

Apr 09 '24 10:04 mmanciop

Alright so I've been hacking on a project for a while that I'm finally ready to release a bit more broadly. It fits the following criteria from Austin's comments:

gives developers and operators a way to view what's going on in the collector in real time
does some minimal filtering on resource and attributes on a live stream of telemetry (works for metrics, traces and logs)
optionally connects to a collector via OpAMP (right now it just lets you view config, view identifying/non identifying attrs)
wholly vendor neutral!
very minimal server overhead, I have it running as a sidecar to running collectors and passively uses around 25-50m of memory!
no storage needed, everything is streamed directly to the client!

You can view the repo here and my blog post that goes over a quick demo of finding some logs missing an attribute, updating the collector config, and viewing it worked!

Here are some screenshots: home clicked filters config

I'd love to answer any questions or take any feedback anyone has to make it suit this issue.

Apr 19 '24 03:04 jaronoff97

@jaronoff97 Thanks! This is pretty much exactly what I had in mind, yeah.

Apr 19 '24 12:04 austinlparker

Not trying to sell anything here, I got informed about this topic from the opentelemetry slack channel. I´ve just released my own tool for the very same purpose here: https://tracelens.io/

Focus on visualization and helping developers to better understand what is going on in a distributed system. e.g. what is actually happening in an IoT solution or in a game etc.

Maybe of value to someone else here. or delete my comment if too unrelated

Apr 29 '24 08:04 rogeralsing

oteps oteps copied to clipboard

Add goals for a OpenTelemetry Desktop Viewer/Development Tool

Logging

Metrics

Tracing

Relations with other CNCF projects

On how to communicate about (lack of) production-readiness

On the storage

oteps
oteps copied to clipboard