skywalking Support the telegraf receiver plugin module

<Support the telegraf receiver plugin module>

[ ] If this is non-trivial feature, paste the links/URLs to the design doc.
[ ] Update the documentation to include this new feature.
[ ] Tests(including UT, IT, E2E) are added to verify the new feature.
[X] If it's UI related, attach the screenshots below.
[ ] If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
[X] Update the CHANGES log.

Description

It's a new way to monitor different servers' metrics. This module is telegraf receiver plugin. It mainly used to receive, process and send telegraf's collected metrics, such as cpu, memory, disk etc.

Design

Modularized the telegraf receiver plugin by Java SPI
Defined the telegraf receiver plugin module name, provider, handler
Made the configuration file to set different MAL expressions

Step

Collected servers’ metrics such as system load, cpu usage, memory usage by InfluxDB telegraf
Received collected JSON format metrics in HTTP messages by RESTful API
Converted JSON format metrics to Skywalking's metrics format by Jackson API
Sent converted metrics to meter system, did the aggregation and stored it to database
Queried stored metrics by GraphQL and displayed on Skywalking's UI

Attention

The module uses HTTP to receive telegraf's metrics, so users should set [[outputs.http]] in telegraf.conf file
The module only processes telegraf's JSON format metrics, so users must configure data_format = "json" in telegraf.conf file
The default json_timestamp_units is second, the module only process second timestamp unit. If you configure json_timestamp_units, json_timestamp_units = "1s" is feasible.

Example

The following screenshot shows the detail query information.

Sep 16 '22 09:09 soander

Any update as has been reviewed for a week.

Sep 21 '22 13:09 wu-sheng

@soander Could you put a summary about which are changed from last comment?

Sep 26 '22 06:09 wu-sheng

@soander Could you put a summary about which are changed from last comment?

@wu-sheng Compared with the last comment:

Code file

Changed code according to suggested changes like createConfigBeanIfAbsent method in the TelegrafReceiverProvider class, ModuleStartException in the TelegrafConfigs class
Refactor code about converting JSON format telegraf data to Java POJO, created new class TelegrafDatum and rewrite class 'TelegrafData'
Removed redundant @JsonProperty annotations in the class TelegrafDatum

Configuration file

Add the telegraf module configuration in the application.yml to activate it

Document file

Update the changes.md
Added the Design, Step, Attention in PR description

Sep 26 '22 06:09 soander

One important thing, please make sure your codes could pass all code checks, UT, and e2e tests before request reviewing. Otherwise, we may even read the wrong codes, which doesn't make sure to make many reviewers put time into a PR.

Sep 26 '22 06:09 wu-sheng

One important thing, please make sure your codes could pass all code checks, UT, and e2e tests before request reviewing. Otherwise, we may even read the wrong codes, which doesn't make sure to make many reviewers put time into a PR.

I'm sorry for the inconvenience. How can I do the code checks, UT, and e2e tests locally? Thank you for your guidance！

Sep 27 '22 02:09 soander

It is better you could ask for your mentor, as this is a Summer 2022 Program.

Generally, there are

All tests are controlled through GitHub Action, https://github.com/apache/skywalking/tree/master/.github/workflows
About e2e tests, https://skywalking.apache.org/docs/main/next/en/guides/readme/#end-to-end-tests-e2e

For merging this in upstream, testing is a very important, also time costing task. It may take more time than you wrote this feature. But this is how the open-source community works, in order to collaborate with people remotely. Be patient.

Sep 27 '22 02:09 wu-sheng

I have contacted my mentor, and I will check documents. Maybe I made some inconvenience because I am a newbie. Anyway, thank you again and sorry for some mistakes.

Sep 27 '22 02:09 soander

I recommend you asking for your mentor, because I think he may have more time than me to help you out with the env and setup issues. This is not an issue from community side, we just have to wait for your tests ready.

Sep 27 '22 02:09 wu-sheng

Marked this as no update, as we can't have any feedback for two weeks. I am not sure whether this is still on going or still prepares for upstream review.

Oct 14 '22 08:10 wu-sheng

Marked this as no update, as we can't have any feedback for two weeks. I am not sure whether this is still on going or still prepares for upstream review.

This PR are still being revised. I will help @soander to add the e2e tests ..

Oct 15 '22 15:10 liuhaoyang

:warning: 6 God Classes were detected by Lift in this project. Visit the Lift web console for more details.

Oct 17 '22 07:10 sonatype-lift[bot]

4 files left for this PR? What happens?

Oct 20 '22 07:10 wu-sheng

4 files left for this PR? What happens?

I made a wrong push...

Oct 20 '22 07:10 soander

What is the current status of this PR?

Oct 20 '22 07:10 wu-sheng

Besides e2e, you should add UTs(Unit Tests) for your format transfer, it could cover more cases easily and help the reviewers to understand the codes.

Oct 20 '22 07:10 wu-sheng

@wu-sheng I see all the e2e tests return failure, what could be causing this ?

Oct 21 '22 08:10 liuhaoyang

https://github.com/apache/skywalking/actions/runs/3288941909

You could find logs from this page, at the bottom of the page. Most likely, the package of this branch can't boot up.

Oct 21 '22 08:10 wu-sheng

Has your mentor reviewed, checked and tested this?

Oct 30 '22 02:10 wu-sheng

Has your mentor reviewed, checked and tested this?

Yes. I have reviewed and performed the tests.

Oct 30 '22 02:10 liuhaoyang

I still noticed there are several mentioned things not been fixed.

There is no unit test to cover telegraf format converting. TelegrafServiceHandler is a HTTP handler. What it received is an HTTP text, what it looks like, and how it could be converted to SampleFamily should be well tested, including legal text or illegal text cases.
You don't explain what is the relationship between this VM monitoring and OpenTelemetry's Linux monitoring.
In backend-vm-monitoring.md, you mentioned, there are two ways to support VM monitoring. But they seem not the same, such as not using the same metric name. So, how the dashboard would work? In the document, there is not any document mentioning that.

I still can't follow the whole picture of this feature implementation. Could any of you explain it? We can't deliver this to the community with this confusion. Especially, telegraf-rules/vm.yaml is included as a part of the official release.

All default activated MAL scripts match with the dashboard settings, as well as the document description about the data flow.

Oct 30 '22 15:10 wu-sheng

I still noticed there are several mentioned things not been fixed.

There is no unit test to cover telegraf format converting. TelegrafServiceHandler is a HTTP handler. What it received is an HTTP text, what it looks like, and how it could be converted to SampleFamily should be well tested, including legal text or illegal text cases.

You don't explain what is the relationship between this VM monitoring and OpenTelemetry's Linux monitoring.

In backend-vm-monitoring.md, you mentioned, there are two ways to support VM monitoring. But they seem not the same, such as not using the same metric name. So, how the dashboard would work? In the document, there is not any document mentioning that.

I still can't follow the whole picture of this feature implementation. Could any of you explain it? We can't deliver this to the community with this confusion. Especially, telegraf-rules/vm.yaml is included as a part of the official release.

All default activated MAL scripts match with the dashboard settings, as well as the document description about the data flow.

I already finish the HTTP text converting to SampleFamily unit test, I will push it later.
Telegraf and OpenTelemetry are two different metrics collection methods, both of them can collect linux system metrics and display it on the UI.
The metrics are collected by telegraf can also be displayed on the VM monitoring dashboard. The default telegraf MAL will convert the telegraf metrics into VM metrics.

Oct 31 '22 00:10 soander

I still noticed there are several mentioned things not been fixed.

There is no unit test to cover telegraf format converting. TelegrafServiceHandler is a HTTP handler. What it received is an HTTP text, what it looks like, and how it could be converted to SampleFamily should be well tested, including legal text or illegal text cases.

You don't explain what is the relationship between this VM monitoring and OpenTelemetry's Linux monitoring.

In backend-vm-monitoring.md, you mentioned, there are two ways to support VM monitoring. But they seem not the same, such as not using the same metric name. So, how the dashboard would work? In the document, there is not any document mentioning that.

I still can't follow the whole picture of this feature implementation. Could any of you explain it? We can't deliver this to the community with this confusion. Especially, telegraf-rules/vm.yaml is included as a part of the official release.

All default activated MAL scripts match with the dashboard settings, as well as the document description about the data flow.

This PR contains two features

Support telegraf-receiver to receive data in telegraf json format and write it to storage
Provide the default MAL config(vm.yaml), and parse the CPU, memory and system metrics reported by telegraf into vm meter, so that these metrics can be displayed on the linux monitoring page. (No disk and network metrics are provided for the time being

For 1, users can use their own MAL to parse more telegraf metrics.

Oct 31 '22 01:10 liuhaoyang

The document seems should be polished after the code's part is ready.

Oct 31 '22 07:10 wu-sheng

@wankai123 @kezhenxu94 @mrproliu I received @liuhaoyang 's messages, this PR should be functional and working. Please take a deeper review.

This is his local testing. (Those two red rectangles are difference metrics from two metric sources.)

@soander @liuhaoyang It is good to see this is getting improved, we need more time to get this ready for the upstream user. Considering this is a new protocol integration, we must be a little more cautious。

I know this may relate to Summer 2022 program somehow, if there is anything that needs my collaboration with the committee, please let me know.

Oct 31 '22 07:10 wu-sheng

There seems to be a problem with the e2e test case of vm-telegraf, @soander can you fix it? Alse, I used the latest commit to test, and the dashboards can be displayed normally (I did not find the Filefd Allocated indicator collection plugin in telegraf, so this chart has no data display for the time being

Waiting for your review @wu-sheng @mrproliu @kezhenxu94 , thanks again

Nov 03 '22 13:11 liuhaoyang

The one thing that needs to be fixed is this,

For the screenshot, the biggest issue is, SampleFamily should have a concept of the group for the bulk. We should not face one SampleFamily have metrics belonging to the same metric and labels, but with multiple timestamps.

Otherwise, the aggregation result would be strange and unpredictable.

Nov 04 '22 03:11 wu-sheng

The others LGTM

Nov 04 '22 11:11 wankai123

One more thing, telegraf should be added as a label for our repo, https://github.com/apache/skywalking/blob/master/.asf.yaml

Nov 04 '22 15:11 wu-sheng

@liuhaoyang Please make sure squash merge log is short and clear. You can see there are a lot of noices in the last commit.

Nov 05 '22 03:11 wu-sheng

skywalking skywalking copied to clipboard

Support the telegraf receiver plugin module

<Support the telegraf receiver plugin module>

Description

Design

Step

Attention

Example

Code file

Configuration file

Document file

skywalking
skywalking copied to clipboard