flowfuse icon indicating copy to clipboard operation
flowfuse copied to clipboard

Observability of Node-RED instances

Open sammachin opened this issue 3 years ago • 10 comments

Description

Display and record/chart the resources used by each project eg:

  • CPU
  • Memory
  • Storage (do flows have access to any persistent storage?)
  • Bandwidth (volume and throughput)

Users should be able to view this for any project they are in a relevant team for, admin should be able to view all projects.

Greg Stoutenburg edit:

I've created this mockup to get us moving on how we can surface resource usage factors: https://www.figma.com/board/VfWsq55uv2ltUE4fVZW0Hy/Observability-Improvements?node-id=0-1&p=f&t=Qm8e1zTXCtLWHXsr-0

Tasks for improving observability:

  • Create Application Health area. This will live within each Application and surface usage facts for CPU, Memory, Storage, and Bandwidth. (Anything missing?) Charts will show overall application health on each of these, as well as recent historical performance. Particular items of concern will be listed and linked to, along with a notice that performance issues can be resolved by upgrading instance size, rearranging a flow across multiple instances, or editing the flow to use less resources.

  • Create Team Health area. This will be a dashboard that provides overall resource usage facts across all applications within a Team, and links to Applications that need attention. Add number of MQTT messages sent/received. Should anything else be added?

  • For both Application Health and Team Health, present a flag that shows the number of items of concern.

  • Determine what should count as "Good", "Concern", and "Warning/Danger". We should present Application and Team health in ways that are clear, intuitive, and actionable. We should be opinionated about Instance health, and not support users running instances that are performing poorly, as this is a bad user experience and can reflect poorly upon the product. For these reasons, show a resource as in the Concern area when it may be on the way to performing sub-optimally, even if performance is below the level that merits sending an alert or "Warning/Danger" status.

Tasks for improving notifications about observability:

  • All Tiers: whenever an instance crashes or hits a Warning/Danger threshold (currently >75% resource usage), send an email and present an alert in the Notification Hub. Have this alert direct back to the Application Health page for the application where the instance resides.

  • All Tiers: each week, send an email to all users to check their Team Health area. Frame it as being about maintaining good performance on their Node-RED instances.

  • On the Free and Starter tiers, send both an email and notification hub message whenever an instance has a status of Concern or Warning. Prevent disabling these emails.

  • On Team and Enterprise, send alerts for Warning status. Provide an option in settings for delivering weekly Team Health notifications via email, Notification Hub, or unsubscribe.

  • On Team and Enterprise, provide an option in settings to receive a weekly notification via email or notification hub informing the user about peak usage for all resources on all applications.

Related:

  • https://github.com/FlowFuse/flowfuse/issues/2755

sammachin avatar Jan 24 '22 08:01 sammachin

Limits for this (on platforms that support it) will be controlled by Project Stacks and covered by #285

Do we still need this epic now we have #285?

hardillb avatar Feb 15 '22 09:02 hardillb

yes this is about visualising what is being used rather than setting limits which is what a stack does, so you need to see what you are using to know how close a project is to needing to move to a larger stack

sammachin avatar Feb 15 '22 10:02 sammachin

as admins of ffcloud we also need this view to manage the platform

sammachin avatar Feb 15 '22 10:02 sammachin

@sammachin @hardillb Just to clarify here... a given project will have limits on CPU & Memory that we want to be able to monitor. Defined by the Project Stack. Will we only know present CPU usage, or will we store historical data?

What is our current state with Storage? Do we have limits in place for this, and could you provide some examples pleas?

I don't recall us having any bandwidth-based metrics as things stand?

joepavitt avatar Jul 07 '22 15:07 joepavitt

@joepavitt CPU & Memory is to be worked out, it may vary based on environment, but assume we have some history

There is no Storage (and I will be very opposed to adding any as it will be a real pain to implement on docker/k8s) the file handling nodes should have all been disabled.

I don't think we can measure bandwidth on a per project basis, so again no data.

hardillb avatar Jul 07 '22 15:07 hardillb

Thanks @hardillb - does beg the question though, what's the value of FF Cloud deployment to Devices if we can't use/test any of the nodes we'd want to use on devices?

joepavitt avatar Jul 07 '22 15:07 joepavitt

That is a separate question

hardillb avatar Jul 07 '22 15:07 hardillb

Noting we've had a few enquires on this topic recently - I would label it 'observability of Node-RED instances'

knolleary avatar Mar 31 '23 13:03 knolleary

@joepavitt I added more details to this epic and would like to move this one along.

gstout52 avatar Apr 21 '25 15:04 gstout52

Thanks @gstout52 - I'll try and get some design work on this in place next week.

joepavitt avatar Apr 25 '25 09:04 joepavitt

Also related: https://github.com/FlowFuse/flowfuse/issues/4193

gstout52 avatar May 05 '25 17:05 gstout52

@hardillb can you share some insight into a high level technical set that we could utilise here please? Quick wins, options we could do (but require more work), etc.

joepavitt avatar May 06 '25 10:05 joepavitt

We currently have the nr-launcher gather memory and CPU information in order to generate the 75% for more than 5min alerts. We do not currently have a way to expose this to the UI or to store more than the 5min buffer used by the nr-launcher. We poll this data at 10 second intervals.

The same data (and more including things like network and file IO information) is also gathered by us using prometheus by monitoring the raw pods in the kubernetes cluster. That data is stored for about 40 days currently so we could potentially read that and use it to provide near realtime (need to double check what the update rate is) and historical data.

The only problem with using the prometheus data is that it is only available on production, not self hosted without users also deploying a similar data collection environment, it may also be kubernetes only (in theory similar data could be gathered on Docker I think).

So the quickest win for FFC only would be to add a suitable prometheus client to the UI and some charting to display the data.

The slightly larger option would probably be to implement some storage of the 5min buffer from the nr-launcher, this could then also be charted and shown to the user, this has the benefit of working for all self hosted options as well.

I can come back and flesh these out a little more later

hardillb avatar May 06 '25 16:05 hardillb

As an extra, we have a prometheus data URL for each hosted NR instance, but polling this from the forge app to gather data directly for 100s (or 1000s) of instance would be a lot of overhead. The nr-launcher polls it's own single NR instance so I'd like avoid having to poll them all again.

hardillb avatar May 06 '25 16:05 hardillb

@ZJvandeWeg I think this feature should be available for all tiers. The epic description breaks down how it can function differently by tier. The goal of this differentiation is to provide insights that improve the experience, drive upgrades for lower tiers, and provide higher quality visibility as one toward Enterprise.

gstout52 avatar May 07 '25 15:05 gstout52

We currently have the nr-launcher gather memory and CPU information in order to generate the 75% for more than 5min alerts. We do not currently have a way to expose this to the UI or to store more than the 5min buffer used by the nr-launcher. We poll this data at 10 second intervals.

So, to clarify, right now we have CPU data for each Node-RED instance at 10 second intervals, for the last 5 minutes?

So the quickest win for FFC only would be to add a suitable prometheus client to the UI and some charting to display the data.

Just so I have a better indication on Prometheus capabilities here, connecting a client from our UI would only get live data from that point, or are we able to load historical data too?

network and file IO information

Any thoughts on the value to the developers for having this information? Also, do we have a single instance per pod, or are we going to get contamination of data across multiple instances?

joepavitt avatar May 07 '25 15:05 joepavitt

So, to clarify, right now we have CPU data for each Node-RED instance at 10 second intervals, for the last 5 minutes?

Yes, but it is not currently available outside the nr-launcher (but could be)

Just so I have a better indication on Prometheus capabilities here, connecting a client from our UI would only get live data from that point, or are we able to load historical data too?

There is no historical data available, it is a current snapshot only (and it requires at least 2 samples and knowing the time between them to get useful CPU usage information)

Any thoughts on the value to the developers for having this information?

network and file io is more useful for use running the platform than really for a end user.

Also, do we have a single instance per pod, or are we going to get contamination of data across multiple instances?

Not 100% sure what you mean by this, but the data is on a per pod basis, it should not include any other pod information.

hardillb avatar May 07 '25 16:05 hardillb

There is no historical data available, it is a current snapshot only (and it requires at least 2 samples and knowing the time between them to get useful CPU usage information)

Still needing some clarification on this please. I want to understand my CPU utilisation over the past 10 mins, how does Prometheus help with this? Or if we provide a "live" CPU view, is that something Prometheus can help with?

joepavitt avatar May 07 '25 16:05 joepavitt

Sorry I mixed a couple of things up earlier (the confusion was between the prometheus system as a whole and a prometheus datasource)

2 separate things here

  • The 5 mins of historical data with 10 seconds between samples held by the nr-launcher (this gets it's data from a prometheus format end point hosted via a NR plugin, but it is only polled by the nr-launcher, we should NOT try to poll it from anywhere else)
  • The Prometheus system which polls a K8s data source (not sure what frequency off top of my head) for all the pods to build a dataset with history. On FFC we currently have ~30 days of history. We (FF admin) can view this in Grafana at this time. This is ONLY useful on k8s and where Prometheus is already deployed and running and we can access it.

hardillb avatar May 07 '25 16:05 hardillb

network and file io is more useful for use running the platform than really for a end user.

As a developer of flows in a Node-RED instance, it's just going to be the CPU utilization that's important right now, right? Potentially something around storage limits for persistent storage?

joepavitt avatar May 07 '25 17:05 joepavitt

Some quick iterations on a "Performance" page which would show top-level performance of all instances for quick identification of issues.

Image

Then when diving into a specific instance, we get the historical view too:

Image

joepavitt avatar May 07 '25 18:05 joepavitt

@gstout52 @knolleary Just noticing that the work done so far has no feature flags. This is something we should be putting in, albeit I believe we've agreed that this is going to be available to all tiers because of the upsell opportunites.

joepavitt avatar May 28 '25 17:05 joepavitt

@joepavitt we need to revalidate that assumption with all stakeholders. This should be an EE feature at a minimum - not part of the CE version.

knolleary avatar May 28 '25 17:05 knolleary

This should be an EE feature at a minimum - not part of the CE version.

Yes, good point.

joepavitt avatar May 28 '25 17:05 joepavitt

@joepavitt Should this get updated to 2.19 for the memory component?

hardillb avatar Jun 09 '25 08:06 hardillb

Thanks for flagging @hardillb

joepavitt avatar Jun 09 '25 08:06 joepavitt

All sub issues are closed here, so closing it out. Also removing "headline" as the child #5590 is already a headline feature

joepavitt avatar Jul 02 '25 13:07 joepavitt