skywalking
skywalking copied to clipboard
Support the telegraf receiver plugin module
<Support the telegraf receiver plugin module>
-
[ ] If this is non-trivial feature, paste the links/URLs to the design doc.
-
[ ] Update the documentation to include this new feature.
-
[ ] Tests(including UT, IT, E2E) are added to verify the new feature.
-
[X] If it's UI related, attach the screenshots below.
-
[ ] If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #
. -
[X] Update the
CHANGES
log.
Description
It's a new way to monitor different servers' metrics. This module is telegraf receiver plugin. It mainly used to receive, process and send telegraf's collected metrics, such as cpu, memory, disk etc.
Design
- Modularized the telegraf receiver plugin by Java SPI
- Defined the telegraf receiver plugin module name, provider, handler
- Made the configuration file to set different MAL expressions
Step
- Collected servers’ metrics such as system load, cpu usage, memory usage by InfluxDB telegraf
- Received collected
JSON
format metrics in HTTP messages by RESTful API - Converted JSON format metrics to Skywalking's metrics format by Jackson API
- Sent converted metrics to meter system, did the aggregation and stored it to database
- Queried stored metrics by GraphQL and displayed on Skywalking's UI
Attention
- The module uses
HTTP
to receive telegraf's metrics, so users should set[[outputs.http]]
in telegraf.conf file - The module only processes telegraf's
JSON
format metrics, so users must configuredata_format = "json"
in telegraf.conf file - The default
json_timestamp_units
issecond
, the module only processsecond
timestamp unit. If you configurejson_timestamp_units
,json_timestamp_units = "1s"
is feasible.
Example
The following screenshot shows the detail query information.
Any update as has been reviewed for a week.
@soander Could you put a summary about which are changed from last comment?
@soander Could you put a summary about which are changed from last comment?
@wu-sheng Compared with the last comment:
Code file
- Changed code according to suggested changes like
createConfigBeanIfAbsent
method in theTelegrafReceiverProvider
class,ModuleStartException
in theTelegrafConfigs
class - Refactor code about converting JSON format telegraf data to Java POJO, created new class
TelegrafDatum
and rewrite class 'TelegrafData' - Removed redundant
@JsonProperty
annotations in the classTelegrafDatum
Configuration file
- Add the telegraf module configuration in the
application.yml
to activate it
Document file
- Update the
changes.md
- Added the
Design
,Step
,Attention
in PR description
One important thing, please make sure your codes could pass all code checks, UT, and e2e tests before request reviewing. Otherwise, we may even read the wrong codes, which doesn't make sure to make many reviewers put time into a PR.
One important thing, please make sure your codes could pass all code checks, UT, and e2e tests before request reviewing. Otherwise, we may even read the wrong codes, which doesn't make sure to make many reviewers put time into a PR.
I'm sorry for the inconvenience. How can I do the code checks, UT, and e2e tests locally? Thank you for your guidance!
It is better you could ask for your mentor, as this is a Summer 2022 Program.
Generally, there are
- All tests are controlled through GitHub Action, https://github.com/apache/skywalking/tree/master/.github/workflows
- About e2e tests, https://skywalking.apache.org/docs/main/next/en/guides/readme/#end-to-end-tests-e2e
For merging this in upstream, testing is a very important, also time costing task. It may take more time than you wrote this feature. But this is how the open-source community works, in order to collaborate with people remotely. Be patient.
I have contacted my mentor, and I will check documents. Maybe I made some inconvenience because I am a newbie. Anyway, thank you again and sorry for some mistakes.
I recommend you asking for your mentor
, because I think he may have more time than me to help you out with the env and setup issues.
This is not an issue from community side, we just have to wait for your tests ready.
Marked this as no update
, as we can't have any feedback for two weeks. I am not sure whether this is still on going or still prepares for upstream review.
Marked this as
no update
, as we can't have any feedback for two weeks. I am not sure whether this is still on going or still prepares for upstream review.
This PR are still being revised. I will help @soander to add the e2e tests ..
:warning: 6 God Classes were detected by Lift in this project. Visit the Lift web console for more details.
4 files left for this PR? What happens?
4 files left for this PR? What happens?
I made a wrong push...
What is the current status of this PR?
Besides e2e, you should add UTs(Unit Tests) for your format transfer, it could cover more cases easily and help the reviewers to understand the codes.
@wu-sheng I see all the e2e tests return failure, what could be causing this ?
https://github.com/apache/skywalking/actions/runs/3288941909
You could find logs from this page, at the bottom of the page. Most likely, the package of this branch can't boot up.
Has your mentor reviewed, checked and tested this?
Has your mentor reviewed, checked and tested this?
Yes. I have reviewed and performed the tests.
I still noticed there are several mentioned things not been fixed.
- There is no unit test to cover telegraf format converting.
TelegrafServiceHandler
is a HTTP handler. What it received is an HTTP text, what it looks like, and how it could be converted to SampleFamily should be well tested, including legal text or illegal text cases. - You don't explain what is the relationship between this VM monitoring and OpenTelemetry's Linux monitoring.
- In backend-vm-monitoring.md, you mentioned, there are two ways to support VM monitoring. But they seem not the same, such as not using the same metric name. So, how the dashboard would work? In the document, there is not any document mentioning that.
I still can't follow the whole picture of this feature implementation. Could any of you explain it? We can't deliver this to the community with this confusion. Especially, telegraf-rules/vm.yaml
is included as a part of the official release.
All default activated MAL scripts match with the dashboard settings, as well as the document description about the data flow.
I still noticed there are several mentioned things not been fixed.
- There is no unit test to cover telegraf format converting.
TelegrafServiceHandler
is a HTTP handler. What it received is an HTTP text, what it looks like, and how it could be converted to SampleFamily should be well tested, including legal text or illegal text cases.- You don't explain what is the relationship between this VM monitoring and OpenTelemetry's Linux monitoring.
- In backend-vm-monitoring.md, you mentioned, there are two ways to support VM monitoring. But they seem not the same, such as not using the same metric name. So, how the dashboard would work? In the document, there is not any document mentioning that.
I still can't follow the whole picture of this feature implementation. Could any of you explain it? We can't deliver this to the community with this confusion. Especially,
telegraf-rules/vm.yaml
is included as a part of the official release.All default activated MAL scripts match with the dashboard settings, as well as the document description about the data flow.
- I already finish the HTTP text converting to SampleFamily unit test, I will push it later.
- Telegraf and OpenTelemetry are two different metrics collection methods, both of them can collect linux system metrics and display it on the UI.
- The metrics are collected by telegraf can also be displayed on the VM monitoring dashboard. The default telegraf MAL will convert the telegraf metrics into VM metrics.
I still noticed there are several mentioned things not been fixed.
- There is no unit test to cover telegraf format converting.
TelegrafServiceHandler
is a HTTP handler. What it received is an HTTP text, what it looks like, and how it could be converted to SampleFamily should be well tested, including legal text or illegal text cases.- You don't explain what is the relationship between this VM monitoring and OpenTelemetry's Linux monitoring.
- In backend-vm-monitoring.md, you mentioned, there are two ways to support VM monitoring. But they seem not the same, such as not using the same metric name. So, how the dashboard would work? In the document, there is not any document mentioning that.
I still can't follow the whole picture of this feature implementation. Could any of you explain it? We can't deliver this to the community with this confusion. Especially,
telegraf-rules/vm.yaml
is included as a part of the official release.All default activated MAL scripts match with the dashboard settings, as well as the document description about the data flow.
This PR contains two features
- Support telegraf-receiver to receive data in telegraf json format and write it to storage
- Provide the default MAL config(
vm.yaml
), and parse the CPU, memory and system metrics reported by telegraf into vm meter, so that these metrics can be displayed on the linux monitoring page. (No disk and network metrics are provided for the time being
For 1, users can use their own MAL to parse more telegraf metrics.
The document seems should be polished after the code's part is ready.
@wankai123 @kezhenxu94 @mrproliu I received @liuhaoyang 's messages, this PR should be functional and working. Please take a deeper review.
This is his local testing. (Those two red rectangles are difference metrics from two metric sources.)
@soander @liuhaoyang It is good to see this is getting improved, we need more time to get this ready for the upstream user. Considering this is a new protocol integration, we must be a little more cautious。
I know this may relate to Summer 2022 program somehow, if there is anything that needs my collaboration with the committee, please let me know.
There seems to be a problem with the e2e test case of vm-telegraf, @soander can you fix it?
Alse, I used the latest commit to test, and the dashboards can be displayed normally (I did not find the Filefd Allocated indicator collection plugin in telegraf, so this chart has no data display for the time being
Waiting for your review @wu-sheng @mrproliu @kezhenxu94 , thanks again
The one thing that needs to be fixed is this,
For the screenshot, the biggest issue is, SampleFamily should have a concept of the group for the bulk. We should not face one SampleFamily have metrics belonging to the same metric and labels, but with multiple timestamps.
Otherwise, the aggregation result would be strange and unpredictable.
The others LGTM
One more thing, telegraf
should be added as a label for our repo, https://github.com/apache/skywalking/blob/master/.asf.yaml
@liuhaoyang Please make sure squash merge log is short and clear. You can see there are a lot of noices in the last commit.