dolphinscheduler
dolphinscheduler copied to clipboard
[DSIP-8][Metrics] Improve DolphinScheduler Monitoring
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
- Monitoring plays an essential role in software stability. However, there is only statics but no metrics in Dolphin Scheduler at present, which means users cannot export metrics to external observation system to monitor their workflows, tasks, as well as DS performance.
- However, to match our slogan
Choose good tools, Back home early. Use Right Scheduler, Sleep Tight.we need richer metrics to increase monitoring ability and give our users better experience using Dolphinscheduler, especially in production environment. - Here are the Email Thread and Proposal.
Use case
- To make the expected improvement described in
Descriptionsection happen, we could take three steps:
- List all the metrics we need classified by different parts of Dolphinscheduler, such as master, worker, api server, etc. Here's the doc link for metrics list.
- Apply the code in the right place and collect these metrics with our metrics-collection frame.
- Find a method to expose these metrics to external system. related: #5255
Action Items
Stage I
- [x] List the basic metrics for workflow / task / system and embed them in the code: #10326 #10867
- [x] Enable developers to test and debug metrics conveniently in standalone mode: #10395
- [x] Establish the naming convention for DS metrics: #10432 #10552
- [x] Add resource download related metrics for workers: #10749
- [x] Add metrics for alert server: #11131
- [ ] Add metrics for api server: #11472
- [x] Check the correctness of metrics when DS deployed with multiple masters and workers.
Stage II
- [ ] Make external monitoring system configurable and extensible.
- [ ] Add popular exporters supported by
MicrometerbesidesPrometheus, such asCloudWatch,Datadog,StatsD,Influx,JMX,Elastic, etc. For a full list, visit MicrometerSetupsection. In addition, to provide users with smooth experience, we should add docker yaml files for each exporter for the demo purpose.
Stage III
- [ ] Add user-configurable metrics filter: #10527
- [ ] Increase the granularity and richness of DS metrics to achieve the same or better observability than Apache Airflow: #10525
Related issues
related: #5255
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Hi:
- Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
- In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
- If you haven't received a reply for a long time, you can subscribe to the developer's email,Mail subscription steps reference https://dolphinscheduler.apache.org/en-us/community/development/subscribe.html ,Then write the issue URL in the email content and send question to [email protected].
I think it's better to including the number of threads related to the execution of the worker and master in the monitoring.
I just updated the google doc in the Use Case section, taking some metrics into consideration.
Another thing I propose we could think about is the granularity of metrics. I find current metrics are general statistics. Statistics of tasks and workflows are separated. We may need some metric like task.duration.<workflow_id>.<task_id> to monitor vital workflows/tasks more accurately. Of course, a side-effect is we will generate explosive number of metrics, leading to some performance issue. To avoid this, two methods will work:
- There will be some config for users to switch on/off generating metrics.
- Dolphin will send those metrics in a UDP way.
Besides, we need some descriptions for exiting metrics in official docs. #9441
@EricGao888 Hi, I close #5255, since there is already a module dolphinscheduler-meter can expose the metrics, and I will take part in this work to provide some common method.
I think this issue is worth DSIP label. WDYT? @zhongjiajie
@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in beta-2? In this way, we could get feedback from users in advance and resolve more potential issues before 3.0.0-stable. WDYT
I think this issue is worth
DSIPlabel. WDYT? @zhongjiajie
Agrees with that, we should add DSIP for this
@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?
@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?
Oh, I remenber you already discuss with an e-mail about the monitoring in https://lists.apache.org/thread/6sogjh6k7f2hv954mhn24c94l2mzwgsz, maybe you should append some words and tell users we want to covert it to DSIP now
@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in
beta-2? In this way, we could get feedback from users in advance and resolve more potential issues before3.0.0-stable. WDYT
It's a good idea. But beta-2 is mainly to fix bugs and email has been sent. So I think we can release it in beta-3.
@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?
Oh, I remenber you already discuss with an e-mail about the monitoring in https://lists.apache.org/thread/6sogjh6k7f2hv954mhn24c94l2mzwgsz, maybe you should append some words and tell users we want to covert it to DSIP now
@zhongjiajie Sure, I will walk through the guide and add some follow-ups in the previous email thread : )
@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in
beta-2? In this way, we could get feedback from users in advance and resolve more potential issues before3.0.0-stable. WDYTIt's a good idea. But
beta-2is mainly to fix bugs and email has been sent. So I think we can release it inbeta-3.
@devosend Make sense to me. In that case, I'd better finish Stage II before beta-3 release. Thx for the information~
@SbloodyS Sorry, I mistakenly clicked the unassign button. Could u plz reassign it to me? Thx! 🤣
@SbloodyS Sorry, I mistakenly clicked the
unassignbutton. Could u plz reassign it to me? Thx! 🤣
Done.
I think we can make a grafana dashboard template in https://grafana.com/grafana/dashboards/ for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.
I think we can make a grafana dashboard template in
https://grafana.com/grafana/dashboards/for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.
I will update the docs so that users could find metrics-related docs easily.
I think we can make a grafana dashboard template in
https://grafana.com/grafana/dashboards/for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.
@SbloodyS I just opened an issue for the comment above. https://github.com/apache/dolphinscheduler/issues/10582
I will submit a PR to add some more metrics related to task resource and alert server sometime this week.
I will submit a PR to add some more metrics related to task resource and alert server sometime this week.
Great Job.
FYI, Prometheus Pushgateway is also supported by Micrometer:
https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#actuator.metrics.export.prometheus
BTW, the StatsD registry eagerly pushes metrics over UDP to a StatsD agent:
https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#actuator.metrics.export.statsd
For some metrics generated (built) during runtime, these two approaches may work.
Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~
Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~
I think it's better put into next version, because we are about to release 3.0.0-release, during this time, we only hope to cherry-pick the pr of bugfix.
Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~
I think it's better put into next version, because we are about to release 3.0.0-release, during this time, we only hope to cherry-pick the pr of bugfix.
Sure, make sense to me. Thx~