orleans Visualising Orleans

Visualising Orleans

Open richorama opened this issue 7 years ago • 24 comments

This issue is to capture ideas/requirements for visualising Orleans, with the hope that we can form a strategy to build some cohesive tooling (probably not in the main Orleans repo to begin with).

There are a number of initiatives to make Orleans easier to visualise and monitor runtime performance:

Outside of Orleans, Service Fabric includes a dashboard out of the box,

Netflix also has some impressive visualisations

What do people want to see? A dashboard, an API, a control panel?

Aug 15 '16 14:08 richorama

I'd like to be able to monitor custom metrics defined and reported by Orleans systems. While activations per silo or grain type are a good high-level gut check that the system is running as expected and the partitioning strategy seems to be working, a much more granular and domain-specific approach to metrics is needed for full transparency and deep insight into live, running systems running at production loads.

This could be as simple as tracking business-valued events with Key Performance Indicators (KPIs) in production, performance monitoring metrics that only run in debug builds in QA, or through silo interception to track uncaught exceptions and grain method calls for full traceability to enable traffic-flow visualizations (like those done in Netflix's Vizceral).

The metrics tracking telemetry consumer is UI agnostic, and provides no UI of its own. Instead, it can be queried for current metrics values at any time, or it can push metrics snapshots out to a virtual stream at any rate (defaulting to once per second). Visualization tools could either subscribe to this metrics stream directly; or an intermediary could pull from the metrics extension and push to the visualization surface.

I like the idea of separating back-end metrics functionality from front-end UI visualization in general, even if the metrics tracker isn't ultimately the best fit in any given scenario. It lets good UI developers focus on making the UI fantastic, and a shared backend means it can be evolved toward a mature set of great features which many tool UIs can take advantage of, rather than having to reinvent the backend wheel for each new monitoring, dashboard, or traffic visualization tool.

Aug 15 '16 14:08 danvanderboom

Awesome initiatives!

As part of #368, I'm working right now on making the counters to push data to IMetricTelemetryConsumers so basically the internal builtin counters will be exposed to anyone thru the telemetry APIs and projects like @danvanderboom would be able to easy catch the incoming Orleans metrics.

So basically both internal KPIs (from the counters/statistics framework) and custom domain-specific ones would be pushed to the telemetry API making those engagements even easier.

I would like to have a control panel that visualise the cluster state but at same time can make changes to the configuration and maybe perform some operations in it like simple Grain calls. But interaction with it I think is a second step and should be discussed in another issue. Visualizing it with a nice tool would be perfect.

Aug 15 '16 15:08 galvesribeiro

I concur with @danvanderboom. I would like to ingest events sourced both from Orleans system and the runtime. Specific examples are technical data such as performance counter data (also without a third party component) and Windows Events, business specific KPIs and also events derived from these. I would also define the persistence sink, or several of them. I don't know how access control factors into this, but should doable.

Aug 15 '16 15:08 veikkoeeva

I should also include a reference to @ReubenBond's console.

The dashboard requires temporary storage of metrics (i.e. values aggregated by short periods (around a second) and stored for ~1 minute). It also wants to query metrics in a variety of ways, i.e. slicing by grain/method, silo etc... and aggregating results (i.e. sum, max, min, avg). It strikes me that there should be some kind of database (SQLite?) to fulfil this?

It would be nice to have some clean, simple interfaces, and then we could have a pluggable architecture where we can combine together different approaches for sampling and collecting data, storing data, and displaying data.

Aug 15 '16 15:08 richorama

@richorama do you think we should add this interface on telemetry APIs? I'm about to make changes in it soon, so shoot anything so I can put all together.

Aug 15 '16 15:08 galvesribeiro

@galvesribeiro I don't know enough about the telemetry API to make the call.

Aug 15 '16 15:08 richorama

I mean, some people once asked me for a way to "cache" the values and batch push it in a timed loop. So what you are suggesting with SQLite is to make it resistant to Silo failures in case it crash so that metrics doesn't get lost, right?

Aug 15 '16 15:08 galvesribeiro

@richorama About "some kind of database", I suppose that means an interface for the sink and other one for it as a source. I see in-memory cache and durable storage as two sinks and/or sources which I'd like to define at my leisure.

&lt:edit: And these sinks and sources, I think we should cue in other new developments in the .NET world and Orleans, specifically the streaming initiatives.

Aug 15 '16 15:08 veikkoeeva

Ok, now I understand :)

The IXXXXTelemetryConsumer are the ingest entrypoint for telemetry. The implementation of those consumers can write its own sink, or just directly push to the underlying APM tool. Do we really need introduce a new set of interfaces to behave as a sink?

Aug 15 '16 15:08 galvesribeiro

I consider some factors of the reliability of metrics caching in https://github.com/danvanderboom/Orleans.TelemetryConsumers.MetricsTracker/issues/4 including an idea for recovering silo metric data on restart, so long as the cluster itself is up and the ClusterMetricsGrain hasn't disappeared with the failed silo.

In that issue, I talk about how some metrics don't have any meaning across silo restarts (like current CPU or memory utilization), while some metrics and counters (like total number of exceptions across the cluster, or dollars of sales accumulated for the day) do have meaning and should be persisted across restarts. So it would be nice to specify for each metric whether it has meaning that spans individual silo restarts, to somehow differentiate it from metrics which can be safely discarded, starting over each time.

Aug 15 '16 15:08 danvanderboom

Hmm, about "dollars of sales accumulated for the day", I'll be more exact that in my previous passage I'd consider it to be application specific KPIs I likely would accumulate and show otherwise. The KPIs I was thinking were more about technical SLAs. Though I would assume that if someone would want the system to show such metrics, it should be doable.

That being written, it looks like we have something concrete to grasp on. :)

Aug 15 '16 16:08 veikkoeeva

Metrics APIs:

TrackMetric, IncrementMetric, and DecrementMetric make a good pattern. These are good for tracking double values.

I'd love to see these added:

TrackCounter, IncrementCounter, DecrementCounter - for long integers
TrackTimeSpanMetric, IncrementTimeSpanMetric, DecrementTimeSpanMetric

Though I don't have a strong opinion of whether these are added to IMetricsTelemetryConsumer, or whether they get split into different interfaces, etc.

Aug 15 '16 16:08 danvanderboom

@veikkoeeva I don't actually have any business-value KPI needs that I'm planning to address with my recent work on metrics tracking. Those scenarios might be better tracked within logic in grains, where grain persistence (or other persistence mechanisms) are in place, rather than in something built as a cross-cutting concern to the Orleans system's domain and primary focus. (I'm brainstorming a little as I write here.:)

Aug 15 '16 16:08 danvanderboom

TrackCounter, IncrementCounter, DecrementCounter - for long integers TrackTimeSpanMetric, IncrementTimeSpanMetric, DecrementTimeSpanMetric

Let me get to the telemetry API again and I'll hit you again with it.

Aug 15 '16 16:08 galvesribeiro

We'll need to be careful about persisting any metrics to disk. If we're persisting every metric update, every counter increment, it will be difficult not to slow the system down. At least in some high-frequency metrics capturing scenarios, if not everywhere, it might be preferable to "compress" a bunch of counter increment/decrement calls into net deltas before flushing them to disk, so long as they occur really close together.

The question is: are we interested in preserving individual deltas and updates, responding to each with durable storage activity? Or are we interested more in occasional snapshots, even if by "occasional" we mean several/many times per second?

What level of "durability" is needed for tracking silo and cluster level metrics? So long as the cluster survives, reliability mechanisms like RAFT have been proposed in Orleans Gitter chats to quickly recover metrics from silo failures and restarts.

Some metrics will need to be more durably and reliably tracked than others.

Aug 15 '16 16:08 danvanderboom

@danvanderboom that is what I'm trying to say... Its a judgement call that a custom telemetry consumer must do when you are implementing it. I don't think your extension should care about it. If someone care about persist it, create another consumer that just store it.

Remember that telemetry ingestion is multicast (you can have multiple consumers for the same kind of telemetry).

Aug 15 '16 16:08 galvesribeiro

@galvesribeiro Agreed. I'm doing that myself, having implemented a second telemetry consumer to route logging calls to a logging service and database.

Aug 15 '16 17:08 danvanderboom

With talk of building an OrleansHostManager which could host multiple Orleans silos on a single machine (and manage rollouts/rollbacks, health monitoring), we're probably talking another level of aggregation for statistics/metrics. We'll want to see how a machine is performing, not just a silo, if multiple silos can be hosted on one machine. It wouldn't make sense to report CPU and memory utilization per silo, for example, if there are several silos running on each node.

This was only one suggested approach for how Orleans might adopt some SF-like features, and may not be the best alternative. But the question of how this could affect the collection and aggregation of metrics is valid regardless of the specific approach, and it'd be nice to see that coming instead of being bitten by it down the road.

Aug 15 '16 21:08 danvanderboom

One concern I have: it looks like there can only be one silo interceptor set at one time, is that right? If multiple extensions or visualizers potentially try to intercept all grain method calls via silo interceptors, I wonder about them stepping on each other, with the last interceptor to be registered taking control.

Can silo interceptors be changed to enable multiple to be registered, similar to the way multiple telemetry consumers can be registered?

What are the "rules of the road" for setting these interceptors, especially from Orleans extensions?

Aug 15 '16 21:08 danvanderboom

The goals for Orniscient are primarily along the lines of visualising specifically virtual actors. We see it as both a training tool, and something that could be used in production to navigate the grain "metaverse".

We view virtual actors as self-contained nano services which in theory should be relatively safe to invoke methods on ad-hoc, and so the ability to find a grain, and invoke a method on it with parameters is something our ops team can benefit from.

Aug 23 '16 14:08 creyke

Also, we'd like to provide an example project running in the cloud with a test cluster, so users unfamiliar with Orleans or even actors, can understand the programming model visually on a website.

Aug 23 '16 14:08 creyke

@danvanderboom when adding a silo interceptor you should call any previously set interceptor.

Aug 23 '16 15:08 richorama

Adding a link to the meetup where visualization solutions where shown - https://github.com/OrleansContrib/meetups#meetup-11-a-monitoring-and-visualisation-show-with-richard-astbury-dan-vanderboom-and-roger-Creyke.

Nov 05 '16 01:11 sergeybykov

We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.

Jul 28 '22 20:07 ghost

orleans orleans copied to clipboard

Visualising Orleans

orleans
orleans copied to clipboard