flowfuse Observability of Node-RED instances

Description

Display and record/chart the resources used by each project eg:

CPU
Memory
Storage (do flows have access to any persistent storage?)
Bandwidth (volume and throughput)

Users should be able to view this for any project they are in a relevant team for, admin should be able to view all projects.

Greg Stoutenburg edit:

I've created this mockup to get us moving on how we can surface resource usage factors: https://www.figma.com/board/VfWsq55uv2ltUE4fVZW0Hy/Observability-Improvements?node-id=0-1&p=f&t=Qm8e1zTXCtLWHXsr-0

Tasks for improving observability:

Create Application Health area. This will live within each Application and surface usage facts for CPU, Memory, Storage, and Bandwidth. (Anything missing?) Charts will show overall application health on each of these, as well as recent historical performance. Particular items of concern will be listed and linked to, along with a notice that performance issues can be resolved by upgrading instance size, rearranging a flow across multiple instances, or editing the flow to use less resources.
Create Team Health area. This will be a dashboard that provides overall resource usage facts across all applications within a Team, and links to Applications that need attention. Add number of MQTT messages sent/received. Should anything else be added?
For both Application Health and Team Health, present a flag that shows the number of items of concern.
Determine what should count as "Good", "Concern", and "Warning/Danger". We should present Application and Team health in ways that are clear, intuitive, and actionable. We should be opinionated about Instance health, and not support users running instances that are performing poorly, as this is a bad user experience and can reflect poorly upon the product. For these reasons, show a resource as in the Concern area when it may be on the way to performing sub-optimally, even if performance is below the level that merits sending an alert or "Warning/Danger" status.

Tasks for improving notifications about observability:

All Tiers: whenever an instance crashes or hits a Warning/Danger threshold (currently >75% resource usage), send an email and present an alert in the Notification Hub. Have this alert direct back to the Application Health page for the application where the instance resides.
All Tiers: each week, send an email to all users to check their Team Health area. Frame it as being about maintaining good performance on their Node-RED instances.
On the Free and Starter tiers, send both an email and notification hub message whenever an instance has a status of Concern or Warning. Prevent disabling these emails.
On Team and Enterprise, send alerts for Warning status. Provide an option in settings for delivering weekly Team Health notifications via email, Notification Hub, or unsubscribe.
On Team and Enterprise, provide an option in settings to receive a weekly notification via email or notification hub informing the user about peak usage for all resources on all applications.

The same data (and more including things like network and file IO information) is also gathered by us using prometheus by monitoring the raw pods in the kubernetes cluster. That data is stored for about 40 days currently so we could potentially read that and use it to provide near realtime (need to double check what the update rate is) and historical data.

The only problem with using the prometheus data is that it is only available on production, not self hosted without users also deploying a similar data collection environment, it may also be kubernetes only (in theory similar data could be gathered on Docker I think).

So the quickest win for FFC only would be to add a suitable prometheus client to the UI and some charting to display the data.

The slightly larger option would probably be to implement some storage of the 5min buffer from the nr-launcher, this could then also be charted and shown to the user, this has the benefit of working for all self hosted options as well.

I can come back and flesh these out a little more later

May 06 '25 16:05 hardillb

As an extra, we have a prometheus data URL for each hosted NR instance, but polling this from the forge app to gather data directly for 100s (or 1000s) of instance would be a lot of overhead. The nr-launcher polls it's own single NR instance so I'd like avoid having to poll them all again.

May 06 '25 16:05 hardillb

@ZJvandeWeg I think this feature should be available for all tiers. The epic description breaks down how it can function differently by tier. The goal of this differentiation is to provide insights that improve the experience, drive upgrades for lower tiers, and provide higher quality visibility as one toward Enterprise.

May 07 '25 15:05 gstout52

We currently have the nr-launcher gather memory and CPU information in order to generate the 75% for more than 5min alerts. We do not currently have a way to expose this to the UI or to store more than the 5min buffer used by the nr-launcher. We poll this data at 10 second intervals.

So, to clarify, right now we have CPU data for each Node-RED instance at 10 second intervals, for the last 5 minutes?

So the quickest win for FFC only would be to add a suitable prometheus client to the UI and some charting to display the data.

Just so I have a better indication on Prometheus capabilities here, connecting a client from our UI would only get live data from that point, or are we able to load historical data too?

network and file IO information

Any thoughts on the value to the developers for having this information? Also, do we have a single instance per pod, or are we going to get contamination of data across multiple instances?

May 07 '25 15:05 joepavitt

So, to clarify, right now we have CPU data for each Node-RED instance at 10 second intervals, for the last 5 minutes?

Yes, but it is not currently available outside the nr-launcher (but could be)

Just so I have a better indication on Prometheus capabilities here, connecting a client from our UI would only get live data from that point, or are we able to load historical data too?

There is no historical data available, it is a current snapshot only (and it requires at least 2 samples and knowing the time between them to get useful CPU usage information)

Any thoughts on the value to the developers for having this information?

network and file io is more useful for use running the platform than really for a end user.

Also, do we have a single instance per pod, or are we going to get contamination of data across multiple instances?

Not 100% sure what you mean by this, but the data is on a per pod basis, it should not include any other pod information.

May 07 '25 16:05 hardillb

There is no historical data available, it is a current snapshot only (and it requires at least 2 samples and knowing the time between them to get useful CPU usage information)

Still needing some clarification on this please. I want to understand my CPU utilisation over the past 10 mins, how does Prometheus help with this? Or if we provide a "live" CPU view, is that something Prometheus can help with?

May 07 '25 16:05 joepavitt

Sorry I mixed a couple of things up earlier (the confusion was between the prometheus system as a whole and a prometheus datasource)

2 separate things here

The 5 mins of historical data with 10 seconds between samples held by the nr-launcher (this gets it's data from a prometheus format end point hosted via a NR plugin, but it is only polled by the nr-launcher, we should NOT try to poll it from anywhere else)
The Prometheus system which polls a K8s data source (not sure what frequency off top of my head) for all the pods to build a dataset with history. On FFC we currently have ~30 days of history. We (FF admin) can view this in Grafana at this time. This is ONLY useful on k8s and where Prometheus is already deployed and running and we can access it.

May 07 '25 16:05 hardillb

network and file io is more useful for use running the platform than really for a end user.

As a developer of flows in a Node-RED instance, it's just going to be the CPU utilization that's important right now, right? Potentially something around storage limits for persistent storage?

May 07 '25 17:05 joepavitt

Some quick iterations on a "Performance" page which would show top-level performance of all instances for quick identification of issues.

Then when diving into a specific instance, we get the historical view too:

May 07 '25 18:05 joepavitt

@gstout52 @knolleary Just noticing that the work done so far has no feature flags. This is something we should be putting in, albeit I believe we've agreed that this is going to be available to all tiers because of the upsell opportunites.

May 28 '25 17:05 joepavitt

@joepavitt we need to revalidate that assumption with all stakeholders. This should be an EE feature at a minimum - not part of the CE version.

May 28 '25 17:05 knolleary

This should be an EE feature at a minimum - not part of the CE version.

Yes, good point.

May 28 '25 17:05 joepavitt

@joepavitt Should this get updated to 2.19 for the memory component?

Jun 09 '25 08:06 hardillb

Thanks for flagging @hardillb

Jun 09 '25 08:06 joepavitt

All sub issues are closed here, so closing it out. Also removing "headline" as the child #5590 is already a headline feature

Jul 02 '25 13:07 joepavitt

Observability of Node-RED instances

Description

Tasks for improving observability:

Tasks for improving notifications about observability:

Related: