Telemetry Collection

Open NicolasMrad opened this issue 1 year ago • 10 comments

Summary

We need to implement a well documented telemetry collection on OS. The data to be collected is usage related and not personal.

Current Situation

No telemetry exists

Why do we need this? Who uses it, and when?

To gain some insight on the usage of the stack by OS users.

Proposed Implementation

Components need to be able to provide their own telemetry separately (so GS, NS, AS, IS, JS separately) but it should be combined in one call to an API endpoint. Again only usage data will be collected and no personal (PII) data will be stored or collected. The opt-out option should be well documented, and once implemented this should be highlighted in the changelog. This needs to be implemented in a minor release.

Contributing

[ ] I can help by doing more research.
[ ] I can help by implementing the feature after the proposal above is approved.
[ ] I can help by testing the feature before it's released.

Code of Conduct

[X] I agree to follow TTN's Community Code of Conduct.

Jul 19 '22 08:07 NicolasMrad

@adriansmares @johanstokking @KrishnaIyer @htdvisser

I'm thinking that this would either involve the usage of something like OpenTelemetry or the usage of the existent metrics pkg which generates Prometheus data. I wanted to get everyone's opinion on what data should be handled.

I imagine the main metrics would be something related to:

Request time
Memory used
Error counter
Panic counter

I'm also assuming that the idea is to have the data accessible in a manner that is possible to discern the data from routes and methods. Having the data of a route described but with the possibility of specifying/requesting data from a specific method in that route, like:

/route -> overall metrics
- methodGet -> metrics regarding this method.
- methodPut -> metrics regarding this method.
  - RegistryInteraction -> data regarding this section.

As an observation I'm not familiar with OpenTelemetry and therefore don't know for certain how feasible is to have the data modeled in this manner, I think it should be somewhat attainable since the docs describe the idea of the data collection being span based.

Besides the data to be stored in the telemetry, there is the topic regarding its usage. Should it be enabled by default? I personally think that it should be disabled in the default config but enabled in the community version, that way people running TTN on their own server don't have to worry about this.

Jul 19 '22 12:07 nicholaspcr

Current Situation

No telemetry exists

This is not true at all. There is already a lot of telemetry in The Things Stack. See https://www.thethingsindustries.com/docs/reference/telemetry/ for details.

From what I understand, what's asked here is to push some of that telemetry to some global service that aggregates this information across all deployments of The Things Stack (depending on opt-in / opt-out of course).

Before we start thinking about the implementation, I think the most important question is what exactly needs to be shared and why?. Because in my opinion most of the telemetry that The Things Stack currently collects (in deployments other than our own) is of no interest to us (and most of it is also none of our business).

Why do we care about request time and memory usage in deployments other than our own? We have nothing to do with the SLAs of those deployments, nor the operations and scaling of their servers. Errors and panics from other deployments can already be shared with us by configuring our Sentry DSN, but what errors do we expect to catch that we won't already catch in our own deployments?

So let's first come up with a short list of maybe 10 metrics, where we clearly explain what is measured and why.

Jul 19 '22 13:07 htdvisser

I might have misunderstood the initial issue to be discussed and indeed the question that you pointed out makes more sense.

Jul 19 '22 13:07 nicholaspcr

Okay I wasn't entirely clear when I asked @NicolasMrad to file an issue.

What I meant is telemetry about how The Things Stack is used as self-managed deployment. That gives us, the maintainers, insight in how the product is used. We might want to make this opt-out on two levels: collecting telemetry (for us) and being part of public aggregated telemetry (for everyone).

Examples say more than a thousand words:

Number of gateways connected vs registered
Number of end devices active vs registered, grouped by LoRaWAN version, band and activation mode
Maximum number of end devices in a single application
Number of user accounts
Number of active integrations (webhooks, active MQTT clients)
TTS version

Jul 19 '22 20:07 johanstokking

For some more inspiration, Syncthing has some really nice public aggregated telemetry: https://data.syncthing.net/

Some things that I would be interested in:

How many deployments are on each version?
How many deployments use a certain OS / architecture? This will for example help us decide if we need to keep supporting architectures such as ARMv6, or if we should provide GOAMD64=v2/v3/v4 builds for better performance on modern CPUs.
How many deployments use a certain distribution (package manager)? This will for example help us decide if we need to keep spending/wasting time on constantly fixing the Snap build.
What languages do users of the deployments prefer (based on browser headers)? This will help us prioritize Console translation work.
Feature usage is something we currently don't measure, but it would be really interesting to see how often features are used so that we can decide if we need to invest more time to make them better, or that we should consider removing features that aren't used (anymore). Right now we may be able to see if/how features are used in our own hosted deployments, but we simply don't know if usage on those deployments is representative for all The Things Stack usage, or that self-hosted deployments are a real blind spot.

Jul 20 '22 09:07 htdvisser

Right, these suggestions also clearly demonstrate the value that this sort of telemetry brings to the maintainers.

Just to be clear and for future reference: we are not going to collect any personal identifiable information, or any user data in general, and the purpose will never be to reach out with commercial offers. If we do the latter, it would be opt-in when "registering" the TTS deployment and signing up for promotions.

Implementation wise, I think we should first consider if we can use prometheus interfaces for this. This way, we would only need to implement a prometheus exporter to upload certain metrics to an API endpoint. The exporter would filter existing metrics by a list of telemetry metrics. This way, we leverage existing metrics and we can easily add new metrics without introducing a new typesystem for that.

@htdvisser @nicholaspcr what do you think?

Jul 20 '22 13:07 johanstokking

Implementation wise, I think we should first consider if we can use prometheus interfaces for this. This way, we would only need to implement a prometheus exporter to upload certain metrics to an API endpoint. The exporter would filter existing metrics by a list of telemetry metrics. This way, we leverage existing metrics and we can easily add new metrics without introducing a new typesystem for that.

Implementation wise I think is indeed what makes the most sense. I'll look a bit more into creating the prometheus exporter this week and try to think if there is anything that could cause a problem but nothing comes to mind at the moment.

I also like quite a bit the idea of monitoring feature usage that @htdvisser gave. In regards to other metrics that might generate this sort of insight, I'm still researching how other OS projects do it, if I find other ideas I'll write them here to get everyone's opinion.

Jul 20 '22 14:07 nicholaspcr

I don't know if piggybacking on Prometheus is the best approach. Prometheus counters are raw data, and I think the type of telemetry we want to have should already be aggregated to some extent. I also don't think we want to end up in a situation where we can't change our operational metrics (from Prometheus to OpenTelemetry) without breaking the aggregated telemetry.

So let's indeed take a look at what's out there and what solutions other open source projects use.

Jul 21 '22 08:07 htdvisser

@NicolasMrad can you set up a meeting to discuss this further?

Sep 06 '22 12:09 johanstokking

Updating the issue with what was discussed in the meeting at 8/09/2022

Regarding the data to be collected in the open source, the objective is to be somewhat concise with what is being collected.

TTS

Unique ID made by the hash of the config URLs

binary version / stack version
os architecture
Number of gateways registered:
- grouped by frequency plan ID
Number of end devices in total
Number of activated end devices
Number of active end devices in the last:
- 24 hours
- 7 days
- 30 days
Number of applications
Number of user accounts
- number of admins
- number of standard users
Number of organization accounts

Future work

Number of end devices active vs registered, grouped by LoRaWAN version, band and activation mode
Maximum number of end devices in a single application
Number of gateway broken down by frequency plan

CLI

binary version / stack version
os architecture

Observations

Explanation of terms used in the list above:

An entity registered means simply that exists in the database.
An entity activated means that the DB has the field active as true.
An entity active means that the value of the last_seen /last_updated is relatively recent.

The data described should be somewhat simple to fetch, meaning it should be on the IS or easy to fetch from already existent methods provided by the stack or the standard library.

In regards to the implementation of the information collector, it was suggested to try to make it more maintainable by using AWS lambda functions and other functionalities, instead of managing the container of the new application. More details regarding the collector will be added to the issue later after I read more on the subject.

Sep 08 '22 15:09 nicholaspcr

The data described should be somewhat simple to fetch, meaning it should be on the IS or easy to fetch from already existent methods provided by the stack or the standard library.

I don't think we should limit telemetry collection to IS. I think we should have component registerers (like we have for services) that can produce arbitrary key/value telemetry in their component namespace. For example, a key metric is number of gateways connected which isn't observable by IS.

In regards to the implementation of the information collector, it was suggested to try to make it more maintainable by using AWS lambda functions and other functionalities, instead of managing the container of the new application. More details regarding the collector will be added to the issue later after I read more on the subject.

This shouldn't be platform specific. I don't understand why it would be more maintainable if we have deployment specific runners if we already have task infrastructure in TTS. TTSOS should be able to produce telemetry and upload it to us.

TTS can run in multiple instances. Even though we don't document that for OS, it is certainly possible to have separate containers for TTS components. With TTSE this is more common. These would all produce their own telemetry, and we must be able to correlate this to one cluster. How do we do that?

Today we don't have a unique key to correlate instances to one cluster. For TTSE we could hash the license key. For TTSOS we may not bother with this too much (as we don't document it) and maybe correlate to origin IP.

Oct 14 '22 08:10 johanstokking

I don't think we should limit telemetry collection to IS. I think we should have component registerers (like we have for services) that can produce arbitrary key/value telemetry in their component namespace. For example, a key metric is number of gateways connected which isn't observable by IS.

Agree. This is probably me not being able to convey properly what was talked on the meeting, since I remember that the idea of these topics are to be a base, a start point of some sort, for the implementation of the telemetry in the OS. These metrics should be somewhat easy to collect, that's why the initial focus on the IS related metrics.

In regards to the implementation of the information collector, it was suggested to try to make it more maintainable by using AWS lambda functions and other functionalities, instead of managing the container of the new application. More details regarding the collector will be added to the issue later after I read more on the subject.

This shouldn't be platform specific. I don't understand why it would be more maintainable if we have deployment specific runners if we already have task infrastructure in TTS. TTSOS should be able to produce telemetry and upload it to us.

The implementation in this case would be of the data receiver (poorly described as the collector in my previous comment). The metrics would be generated by TTS and sent to the receiver which would be a lambda function ( still have to read on this ).

TTS can run in multiple instances. Even though we don't document that for OS, it is certainly possible to have separate containers for TTS components. With TTSE this is more common. These would all produce their own telemetry, and we must be able to correlate this to one cluster. How do we do that?

Today we don't have a unique key to correlate instances to one cluster. For TTSE we could hash the license key. For TTSOS we may not bother with this too much (as we don't document it) and maybe correlate to origin IP.

Making the uniqueID a hash of the config URL would indeed make the metrics non attachable to a cluster in OS, since we don't have a unique value which is shared between each component and distinct to other deployments. I imagine that the amount of people that have their own deployment of each component of TTS is marginal, I think it's appropriate to keep the hash of config URLs idea on OS and use the license key hash for ES.

Oct 18 '22 22:10 nicholaspcr

Writing in here to update the status of the issue.

With the #6021 PR approved the defined fields are collected in the Stack and the CLI. The missing steps for the full flow of telemetry collection is currently as follows:

[x] Telemetry collector - Lambda function that will just write directly into the corresponding table in DynamoDB. For example, IS's entity count telemetry task will have its data saved into entity_count which should have the date as a secondary index.
[x] Daily Sweeper - A script to run daily in which it would fetch the last yesterday's entries within each telemetry information collected that is present on the DynamoDB (that's why use date as an index).
[ ] Graph generator - Responsible for generating images for each telemetry data collected. Should build images on top of the daily sums done by the sweeper.
[ ] Web page on thethingsnetwork.org which shows all the generated graphs
- Should be as simple as possible, just with the title of the graph and the image itself should include the labels and legends to its own data.

Mar 27 '23 19:03 nicholaspcr

Closing this issue in favour of the one present in the product management repository.

The last updated made on march is not the current implementation as the Daily Sweeper and Graph generator were discarded by the preference to use TimestreamDB and Grafana. The issue linked on the management repository are more up to date and therefore will serve as a better umbrella issue.

Jul 04 '23 11:07 nicholaspcr

Ref: https://github.com/TheThingsIndustries/product-management/issues/11

@nicholaspcr: Please remember to link the issue that an issue replaces for tracking.

Jul 12 '23 12:07 KrishnaIyer

I didn't reference the comment because there was a link pointing to it right above. Nevertheless, next time, I will reference the issue.

Jul 12 '23 12:07 nicholaspcr

lorawan-stack lorawan-stack copied to clipboard

Telemetry Collection

Summary

Current Situation

Why do we need this? Who uses it, and when?

Proposed Implementation

Contributing

Code of Conduct

TTS

Unique ID made by the hash of the config URLs

Future work

CLI

Observations

lorawan-stack
lorawan-stack copied to clipboard