aspire icon indicating copy to clipboard operation
aspire copied to clipboard

Aspire Dashboard - Persist telemetry data

Open sharpSteff opened this issue 1 year ago • 8 comments

Currently the telemetry data is hold in a cycular buffer in memory. It would be great to have the option to persist the data.

sharpSteff avatar May 22 '24 14:05 sharpSteff

@sharpSteff can you elaborate a little more on the scenario here? Are you focussed on the local dev loop, or a hosted dashboard? What about data that's missed while the dashboard is offline/restarting?

drewnoakes avatar May 23 '24 13:05 drewnoakes

@drewnoakes I would like to elaborate on this feature request, as it would be very beneficial for my use case.

Using the Standalone Aspire Dashboard, and having multiple microservices use OTEL to send data to this dashboard is vital for log aggregation and metrics.

However, server maintenance often requires patching and rebooting, which results in the loss of all data stored in memory by the Standalone Aspire Dashboard as mentioned here. This can be problematic for production scenarios.

This feature would help prevent the loss of log files during such reboots. While I don't expect the Standalone Aspire Dashboard to retain logs indefinitely, a grace period of 30-90 days would suffice for most production scenarios. This feature would also prevent data loss when updating the Docker image, since that requires the container to be restarted.

I wrote to @davidfowl on LinkedIn and he mentioned that there are currently no short-term plans for long-term storage ;-)

Drsela avatar May 23 '24 14:05 Drsela

I think this is the most reasonable reason to support for optionally persist telemetry, reboots. That said, we do not want to build a data storage model optimized for querying logs, metrics and traces. My thought is that we would support a best effort flushing telemetry to disk on graceful shutdown (or on some interval) to handle the survival of reboots/upgrades. This isn't a scalable persistent store optimized for queries (this is where real APM systems excel).

To be clear on the intent here:

  • This dashboard should be considered a "live view" of telemetry with a fixed buffer size (configurable).
  • There's no plan to build a scalable backend storage or pluggable storage engine for the dashboard.
  • If you have a large-scale system sending lots of telemetry to the dashboard likely won't work very well (large cluster with thousands of services sending telemetry). This is why most large scale APM systems separate collection and analytics.
  • This would not be pluggable storage, we write to disk. I'd like to avoid anything complex an extensible. You can mount a volume or copy that data blob if you need it to go elsewhere.

davidfowl avatar May 24 '24 03:05 davidfowl

@Drsela sums it up quite nicely. Aspire Dashboard is sufficient for my otlp usecase. I like to see the last couple of months of data and don't want to cry when my host has to restart. I do not need separated otlp-collector and db to ensure every bit of data. I use it more like to show trends.

sharpSteff avatar May 24 '24 03:05 sharpSteff

I do not need separated otlp-collector and db to ensure every bit of data. I use it more like to show trends.

How are you going to get trends with the dashboard? We're not adding those features.

davidfowl avatar May 24 '24 04:05 davidfowl

I do not need separated otlp-collector and db to ensure every bit of data. I use it more like to show trends.

How are you going to get trends with the dashboard? We're not adding those features.

I added myself a custom sub-page presenting the data

sharpSteff avatar May 24 '24 07:05 sharpSteff

Are you running a fork of the dashboard?

mitchdenny avatar May 24 '24 09:05 mitchdenny

My thought is that we would support a best effort flushing telemetry to disk on graceful shutdown (or on some interval) to handle the survival of reboots/upgrades

This would be sufficient for our use case :-) Right now we're using a self-developed dashboard that are reading NLog (rolling windows + 7 days) files and visualises them. We also use Application Insights for 'true' APM .

Replacing our own dashboard with standalone Aspire would be awesome. And flusning the telemetry to disk on graceful shutdown would be a great way for us to migrate to standalone Aspire.

Drsela avatar May 26 '24 10:05 Drsela

Moving to backlog as this is not going to be in 9.0

samsp-msft avatar Aug 26 '24 19:08 samsp-msft

Since this was already triaged wouldn't it make sense to move it to 10.0?

alrz avatar Aug 26 '24 20:08 alrz

@davidfowl

we do not want to build a data storage model optimized for querying logs, metrics and traces.

It's not unusual for APM solutions to use a third-party service for storage, eg signoz uses clickhouse. So they mainly only provide UI and RBAC. Do you imagine the dashboard grow to a full observability solution intended for production? I'd like the fact that it could provide sensible defaults for aspnetcore/aspire apps.

alrz avatar Aug 30 '24 08:08 alrz

Since this was already triaged wouldn't it make sense to move it to 10.0?

We'll reprioritize with other things once we get to scheduling .NET Aspire 10.0 investments. Its possible that it could be 9.x too.

mitchdenny avatar Sep 01 '24 23:09 mitchdenny

I'd encourage you to reconsider your goals.

The state of observability solutions is awful. Developers need to combine at least 4 different products (Grafana, Prometheus, Jaeger, Loki) just to achieve the same basic tools that Aspire Dashboard provides. So in this context, there's a big opportunity for it to become a widely used tool if you support production scenarios.

One of the causes of this problem is precisely the industry's obsession with "scalable" solutions. The vast majority of developers don't need a dedicated collector or a separate, specialized database. I have not seen any evidence that a standard PostgreSQL database, or maybe something like DuckDB, wouldn't be more than enough for persistence. In fact, DuckDB seems ideal for this use case: it's embeddable, simple to use, reliable and efficient.

The point is, persisting data should not be hard, and any solution here is better than no solution.

piju3 avatar Oct 18 '24 16:10 piju3

In your opinion, where's the line between this and a full blown APM? There's a level of scale we will never optimize for which I think is understood, but there's definitely an interesting small scale draw for having a simple to deploy all in one tool.

davidfowl avatar Oct 18 '24 20:10 davidfowl

I'll admit I've only started using telemetry tools recently, so I don't know how far they go. But from what I've seen, I think Aspire Dashboard is already a decent APM tool, minus the persistence. Hence why I think it has potential.

You'd probably need to test what kind of scale it can handle and give some indication in the documentation so people can know if it's going to be enough for their use case.

piju3 avatar Oct 19 '24 15:10 piju3

@piju3

I'd encourage you to reconsider your goals.

The state of observability solutions is awful. Developers need to combine at least 4 different products (Grafana, Prometheus, Jaeger, Loki) just to achieve the same basic tools that Aspire Dashboard provides. So in this context, there's a big opportunity for it to become a widely used tool if you support production scenarios.

I use Honeycomb.io for some of my projects in production (and Aspire Dashboard for local development). Maybe this could also work for you. You could do a quick test, because all it takes is a free account and a change in your open telemetry configuration.

VolkmarR avatar Nov 14 '24 11:11 VolkmarR

In your opinion, where's the line between this and a full blown APM? There's a level of scale we will never optimize for which I think is understood, but there's definitely an interesting small scale draw for having a simple to deploy all in one tool.

I can understand how any step in the direction of APM poses challenges technical, commercial (AppInsights), and to the project scope and focus. So what if the goal was simply to support deployment of the collector and dashboard as a sidecar or standalone container. Like a modern version of ELMAH. It would help with troubleshooting the new publishing providers and discrepancies between the system in dev and in production. It's a natural extension of Aspire's new deployment features without committing to anything beyond the current project scope.

I for one would love to integrate a little admin-only drawer on my site that proxied the dashboard so I can watch user activity in realtime and see how the resources are handling transition to their deployed homes.

📎 It's possible this is already possible, I just had little luck deploying the "standalone" dashboard on AWS (which, commercially, would be a nice place to establish a little beachhead and only fair as they cheekily took the Application Insights name for their own overwrought apm system)

📎 Apologies in advance if ELMAH is already a modern version of ELMAH, just a good example of how a small scale troubleshooting extension provided value in the past.

jakenuts avatar May 20 '25 15:05 jakenuts

We will ship some minimal persistence model that will not scale to large deployments within the next few releases.

davidfowl avatar May 20 '25 15:05 davidfowl

We will ship some minimal persistence model that will not scale to large deployments within the next few releases.

I'm very glad to hear that 😄

Would you mind sharing where you are (roughly) drawing the line between large-scale and non-large-scale deployments?

wertzui avatar May 20 '25 15:05 wertzui

The dashboard itself being the collector (something we don't plan to change in the short term) means it's the bottleneck for ingesting telemetry. There will be inherent limits to this model and why you pay $$$ for APM systems (they are really databases in disguise 😄). This will work fine for smaller deployments that or low traffic deployment that don't need a massively scalable telemetry system but want to onboard to open telemetry.

We also have plans to support recording state for test runs so we can replay them in the dashboard. Not exactly what this issue is asking for, but another reason to support saving/importing dashboard data.

davidfowl avatar May 20 '25 15:05 davidfowl