skywalking icon indicating copy to clipboard operation
skywalking copied to clipboard

Support the telegraf receiver plugin module

Open soander opened this issue 2 years ago • 10 comments

<Support the telegraf receiver plugin module>

  • [ ] If this is non-trivial feature, paste the links/URLs to the design doc.

  • [ ] Update the documentation to include this new feature.

  • [ ] Tests(including UT, IT, E2E) are added to verify the new feature.

  • [X] If it's UI related, attach the screenshots below.

  • [ ] If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.

  • [X] Update the CHANGES log.

Description

It's a new way to monitor different servers' metrics. This module is telegraf receiver plugin. It mainly used to receive, process and send telegraf's collected metrics, such as cpu, memory, disk etc.

Design

  • Modularized the telegraf receiver plugin by Java SPI
  • Defined the telegraf receiver plugin module name, provider, handler
  • Made the configuration file to set different MAL expressions

Step

  • Collected servers’ metrics such as system load, cpu usage, memory usage by InfluxDB telegraf
  • Received collected JSON format metrics in HTTP messages by RESTful API
  • Converted JSON format metrics to Skywalking's metrics format by Jackson API
  • Sent converted metrics to meter system, did the aggregation and stored it to database
  • Queried stored metrics by GraphQL and displayed on Skywalking's UI

Attention

  • The module uses HTTP to receive telegraf's metrics, so users should set [[outputs.http]] in telegraf.conf file
  • The module only processes telegraf's JSON format metrics, so users must configure data_format = "json" in telegraf.conf file
  • The default json_timestamp_units is second, the module only process second timestamp unit. If you configure json_timestamp_units, json_timestamp_units = "1s" is feasible.

Example

The following screenshot shows the detail query information. image

soander avatar Sep 16 '22 09:09 soander

Any update as has been reviewed for a week.

wu-sheng avatar Sep 21 '22 13:09 wu-sheng

@soander Could you put a summary about which are changed from last comment?

wu-sheng avatar Sep 26 '22 06:09 wu-sheng

@soander Could you put a summary about which are changed from last comment?

@wu-sheng Compared with the last comment:

Code file

  • Changed code according to suggested changes like createConfigBeanIfAbsent method in the TelegrafReceiverProvider class, ModuleStartException in the TelegrafConfigs class
  • Refactor code about converting JSON format telegraf data to Java POJO, created new class TelegrafDatum and rewrite class 'TelegrafData'
  • Removed redundant @JsonProperty annotations in the class TelegrafDatum

Configuration file

  • Add the telegraf module configuration in the application.yml to activate it

Document file

  • Update the changes.md
  • Added the Design, Step, Attention in PR description

soander avatar Sep 26 '22 06:09 soander

One important thing, please make sure your codes could pass all code checks, UT, and e2e tests before request reviewing. Otherwise, we may even read the wrong codes, which doesn't make sure to make many reviewers put time into a PR.

wu-sheng avatar Sep 26 '22 06:09 wu-sheng

One important thing, please make sure your codes could pass all code checks, UT, and e2e tests before request reviewing. Otherwise, we may even read the wrong codes, which doesn't make sure to make many reviewers put time into a PR.

I'm sorry for the inconvenience. How can I do the code checks, UT, and e2e tests locally? Thank you for your guidance!

soander avatar Sep 27 '22 02:09 soander

It is better you could ask for your mentor, as this is a Summer 2022 Program.

Generally, there are

  • All tests are controlled through GitHub Action, https://github.com/apache/skywalking/tree/master/.github/workflows
  • About e2e tests, https://skywalking.apache.org/docs/main/next/en/guides/readme/#end-to-end-tests-e2e

For merging this in upstream, testing is a very important, also time costing task. It may take more time than you wrote this feature. But this is how the open-source community works, in order to collaborate with people remotely. Be patient.

wu-sheng avatar Sep 27 '22 02:09 wu-sheng

I have contacted my mentor, and I will check documents. Maybe I made some inconvenience because I am a newbie. Anyway, thank you again and sorry for some mistakes.

soander avatar Sep 27 '22 02:09 soander

I recommend you asking for your mentor, because I think he may have more time than me to help you out with the env and setup issues. This is not an issue from community side, we just have to wait for your tests ready.

wu-sheng avatar Sep 27 '22 02:09 wu-sheng

Marked this as no update, as we can't have any feedback for two weeks. I am not sure whether this is still on going or still prepares for upstream review.

wu-sheng avatar Oct 14 '22 08:10 wu-sheng

Marked this as no update, as we can't have any feedback for two weeks. I am not sure whether this is still on going or still prepares for upstream review.

This PR are still being revised. I will help @soander to add the e2e tests ..

liuhaoyang avatar Oct 15 '22 15:10 liuhaoyang

:warning: 6 God Classes were detected by Lift in this project. Visit the Lift web console for more details.

sonatype-lift[bot] avatar Oct 17 '22 07:10 sonatype-lift[bot]

4 files left for this PR? What happens?

wu-sheng avatar Oct 20 '22 07:10 wu-sheng

4 files left for this PR? What happens?

I made a wrong push...

soander avatar Oct 20 '22 07:10 soander

What is the current status of this PR?

wu-sheng avatar Oct 20 '22 07:10 wu-sheng

Besides e2e, you should add UTs(Unit Tests) for your format transfer, it could cover more cases easily and help the reviewers to understand the codes.

wu-sheng avatar Oct 20 '22 07:10 wu-sheng

@wu-sheng I see all the e2e tests return failure, what could be causing this ?

liuhaoyang avatar Oct 21 '22 08:10 liuhaoyang

https://github.com/apache/skywalking/actions/runs/3288941909

You could find logs from this page, at the bottom of the page. Most likely, the package of this branch can't boot up.

wu-sheng avatar Oct 21 '22 08:10 wu-sheng

Has your mentor reviewed, checked and tested this?

wu-sheng avatar Oct 30 '22 02:10 wu-sheng

Has your mentor reviewed, checked and tested this?

Yes. I have reviewed and performed the tests.

liuhaoyang avatar Oct 30 '22 02:10 liuhaoyang

I still noticed there are several mentioned things not been fixed.

  1. There is no unit test to cover telegraf format converting. TelegrafServiceHandler is a HTTP handler. What it received is an HTTP text, what it looks like, and how it could be converted to SampleFamily should be well tested, including legal text or illegal text cases.
  2. You don't explain what is the relationship between this VM monitoring and OpenTelemetry's Linux monitoring.
  3. In backend-vm-monitoring.md, you mentioned, there are two ways to support VM monitoring. But they seem not the same, such as not using the same metric name. So, how the dashboard would work? In the document, there is not any document mentioning that.

I still can't follow the whole picture of this feature implementation. Could any of you explain it? We can't deliver this to the community with this confusion. Especially, telegraf-rules/vm.yaml is included as a part of the official release.

All default activated MAL scripts match with the dashboard settings, as well as the document description about the data flow.

wu-sheng avatar Oct 30 '22 15:10 wu-sheng

I still noticed there are several mentioned things not been fixed.

  1. There is no unit test to cover telegraf format converting. TelegrafServiceHandler is a HTTP handler. What it received is an HTTP text, what it looks like, and how it could be converted to SampleFamily should be well tested, including legal text or illegal text cases.
  2. You don't explain what is the relationship between this VM monitoring and OpenTelemetry's Linux monitoring.
  3. In backend-vm-monitoring.md, you mentioned, there are two ways to support VM monitoring. But they seem not the same, such as not using the same metric name. So, how the dashboard would work? In the document, there is not any document mentioning that.

I still can't follow the whole picture of this feature implementation. Could any of you explain it? We can't deliver this to the community with this confusion. Especially, telegraf-rules/vm.yaml is included as a part of the official release.

All default activated MAL scripts match with the dashboard settings, as well as the document description about the data flow.

  1. I already finish the HTTP text converting to SampleFamily unit test, I will push it later.
  2. Telegraf and OpenTelemetry are two different metrics collection methods, both of them can collect linux system metrics and display it on the UI.
  3. The metrics are collected by telegraf can also be displayed on the VM monitoring dashboard. The default telegraf MAL will convert the telegraf metrics into VM metrics.

soander avatar Oct 31 '22 00:10 soander

I still noticed there are several mentioned things not been fixed.

  1. There is no unit test to cover telegraf format converting. TelegrafServiceHandler is a HTTP handler. What it received is an HTTP text, what it looks like, and how it could be converted to SampleFamily should be well tested, including legal text or illegal text cases.
  2. You don't explain what is the relationship between this VM monitoring and OpenTelemetry's Linux monitoring.
  3. In backend-vm-monitoring.md, you mentioned, there are two ways to support VM monitoring. But they seem not the same, such as not using the same metric name. So, how the dashboard would work? In the document, there is not any document mentioning that.

I still can't follow the whole picture of this feature implementation. Could any of you explain it? We can't deliver this to the community with this confusion. Especially, telegraf-rules/vm.yaml is included as a part of the official release.

All default activated MAL scripts match with the dashboard settings, as well as the document description about the data flow.

This PR contains two features

  1. Support telegraf-receiver to receive data in telegraf json format and write it to storage
  2. Provide the default MAL config(vm.yaml), and parse the CPU, memory and system metrics reported by telegraf into vm meter, so that these metrics can be displayed on the linux monitoring page. (No disk and network metrics are provided for the time being

For 1, users can use their own MAL to parse more telegraf metrics.

liuhaoyang avatar Oct 31 '22 01:10 liuhaoyang

The document seems should be polished after the code's part is ready.

wu-sheng avatar Oct 31 '22 07:10 wu-sheng

@wankai123 @kezhenxu94 @mrproliu I received @liuhaoyang 's messages, this PR should be functional and working. Please take a deeper review.

This is his local testing. (Those two red rectangles are difference metrics from two metric sources.) image

@soander @liuhaoyang It is good to see this is getting improved, we need more time to get this ready for the upstream user. Considering this is a new protocol integration, we must be a little more cautious。

I know this may relate to Summer 2022 program somehow, if there is anything that needs my collaboration with the committee, please let me know.

wu-sheng avatar Oct 31 '22 07:10 wu-sheng

There seems to be a problem with the e2e test case of vm-telegraf, @soander can you fix it? Alse, I used the latest commit to test, and the dashboards can be displayed normally (I did not find the Filefd Allocated indicator collection plugin in telegraf, so this chart has no data display for the time being image

Waiting for your review @wu-sheng @mrproliu @kezhenxu94 , thanks again

liuhaoyang avatar Nov 03 '22 13:11 liuhaoyang

The one thing that needs to be fixed is this,

For the screenshot, the biggest issue is, SampleFamily should have a concept of the group for the bulk. We should not face one SampleFamily have metrics belonging to the same metric and labels, but with multiple timestamps.

Otherwise, the aggregation result would be strange and unpredictable.

wu-sheng avatar Nov 04 '22 03:11 wu-sheng

The others LGTM

wankai123 avatar Nov 04 '22 11:11 wankai123

One more thing, telegraf should be added as a label for our repo, https://github.com/apache/skywalking/blob/master/.asf.yaml

wu-sheng avatar Nov 04 '22 15:11 wu-sheng

@liuhaoyang Please make sure squash merge log is short and clear. You can see there are a lot of noices in the last commit.

wu-sheng avatar Nov 05 '22 03:11 wu-sheng