dolphinscheduler icon indicating copy to clipboard operation
dolphinscheduler copied to clipboard

[DSIP-8][Metrics] Improve DolphinScheduler Monitoring

Open EricGao888 opened this issue 3 years ago • 24 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

  • Monitoring plays an essential role in software stability. However, there is only statics but no metrics in Dolphin Scheduler at present, which means users cannot export metrics to external observation system to monitor their workflows, tasks, as well as DS performance.
  • However, to match our slogan Choose good tools, Back home early. Use Right Scheduler, Sleep Tight. we need richer metrics to increase monitoring ability and give our users better experience using Dolphinscheduler, especially in production environment.
  • Here are the Email Thread and Proposal.

Use case

  • To make the expected improvement described in Description section happen, we could take three steps:
  1. List all the metrics we need classified by different parts of Dolphinscheduler, such as master, worker, api server, etc. Here's the doc link for metrics list.
  2. Apply the code in the right place and collect these metrics with our metrics-collection frame.
  3. Find a method to expose these metrics to external system. related: #5255

Action Items

Stage I

  • [x] List the basic metrics for workflow / task / system and embed them in the code: #10326 #10867
  • [x] Enable developers to test and debug metrics conveniently in standalone mode: #10395
  • [x] Establish the naming convention for DS metrics: #10432 #10552
  • [x] Add resource download related metrics for workers: #10749
  • [x] Add metrics for alert server: #11131
  • [ ] Add metrics for api server: #11472
  • [x] Check the correctness of metrics when DS deployed with multiple masters and workers.

Stage II

  • [ ] Make external monitoring system configurable and extensible.
  • [ ] Add popular exporters supported by Micrometer besides Prometheus, such as CloudWatch, Datadog, StatsD, Influx, JMX, Elastic, etc. For a full list, visit Micrometer Setup section. In addition, to provide users with smooth experience, we should add docker yaml files for each exporter for the demo purpose.

Stage III

  • [ ] Add user-configurable metrics filter: #10527
  • [ ] Increase the granularity and richness of DS metrics to achieve the same or better observability than Apache Airflow: #10525

Related issues

related: #5255

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

Code of Conduct

EricGao888 avatar Apr 02 '22 03:04 EricGao888

Hi:

  • Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
  • In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
  • If you haven't received a reply for a long time, you can subscribe to the developer's email,Mail subscription steps reference https://dolphinscheduler.apache.org/en-us/community/development/subscribe.html ,Then write the issue URL in the email content and send question to [email protected].

github-actions[bot] avatar Apr 02 '22 03:04 github-actions[bot]

I think it's better to including the number of threads related to the execution of the worker and master in the monitoring.

SbloodyS avatar Apr 02 '22 03:04 SbloodyS

I just updated the google doc in the Use Case section, taking some metrics into consideration.

Another thing I propose we could think about is the granularity of metrics. I find current metrics are general statistics. Statistics of tasks and workflows are separated. We may need some metric like task.duration.<workflow_id>.<task_id> to monitor vital workflows/tasks more accurately. Of course, a side-effect is we will generate explosive number of metrics, leading to some performance issue. To avoid this, two methods will work:

  1. There will be some config for users to switch on/off generating metrics.
  2. Dolphin will send those metrics in a UDP way.

EricGao888 avatar Apr 17 '22 11:04 EricGao888

Besides, we need some descriptions for exiting metrics in official docs. #9441

EricGao888 avatar Apr 17 '22 11:04 EricGao888

@EricGao888 Hi, I close #5255, since there is already a module dolphinscheduler-meter can expose the metrics, and I will take part in this work to provide some common method.

ruanwenjun avatar May 31 '22 10:05 ruanwenjun

I think this issue is worth DSIP label. WDYT? @zhongjiajie

SbloodyS avatar Jun 21 '22 09:06 SbloodyS

@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in beta-2? In this way, we could get feedback from users in advance and resolve more potential issues before 3.0.0-stable. WDYT

EricGao888 avatar Jun 21 '22 12:06 EricGao888

I think this issue is worth DSIP label. WDYT? @zhongjiajie

Agrees with that, we should add DSIP for this

zhongjiajie avatar Jun 22 '22 01:06 zhongjiajie

@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?

zhongjiajie avatar Jun 22 '22 01:06 zhongjiajie

@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?

Oh, I remenber you already discuss with an e-mail about the monitoring in https://lists.apache.org/thread/6sogjh6k7f2hv954mhn24c94l2mzwgsz, maybe you should append some words and tell users we want to covert it to DSIP now

zhongjiajie avatar Jun 22 '22 01:06 zhongjiajie

@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in beta-2? In this way, we could get feedback from users in advance and resolve more potential issues before 3.0.0-stable. WDYT

It's a good idea. But beta-2 is mainly to fix bugs and email has been sent. So I think we can release it in beta-3.

devosend avatar Jun 22 '22 02:06 devosend

@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?

Oh, I remenber you already discuss with an e-mail about the monitoring in https://lists.apache.org/thread/6sogjh6k7f2hv954mhn24c94l2mzwgsz, maybe you should append some words and tell users we want to covert it to DSIP now

@zhongjiajie Sure, I will walk through the guide and add some follow-ups in the previous email thread : )

EricGao888 avatar Jun 22 '22 02:06 EricGao888

@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in beta-2? In this way, we could get feedback from users in advance and resolve more potential issues before 3.0.0-stable. WDYT

It's a good idea. But beta-2 is mainly to fix bugs and email has been sent. So I think we can release it in beta-3.

@devosend Make sense to me. In that case, I'd better finish Stage II before beta-3 release. Thx for the information~

EricGao888 avatar Jun 22 '22 02:06 EricGao888

@SbloodyS Sorry, I mistakenly clicked the unassign button. Could u plz reassign it to me? Thx! 🤣

EricGao888 avatar Jun 22 '22 03:06 EricGao888

@SbloodyS Sorry, I mistakenly clicked the unassign button. Could u plz reassign it to me? Thx! 🤣

Done.

SbloodyS avatar Jun 22 '22 03:06 SbloodyS

I think we can make a grafana dashboard template in https://grafana.com/grafana/dashboards/ for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.

SbloodyS avatar Jun 23 '22 10:06 SbloodyS

I think we can make a grafana dashboard template in https://grafana.com/grafana/dashboards/ for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.

I will update the docs so that users could find metrics-related docs easily.

EricGao888 avatar Jun 23 '22 10:06 EricGao888

I think we can make a grafana dashboard template in https://grafana.com/grafana/dashboards/ for users to use directly. So that we can reduce user use cost and learning cost, and users can also transform based on template.

@SbloodyS I just opened an issue for the comment above. https://github.com/apache/dolphinscheduler/issues/10582

EricGao888 avatar Jun 23 '22 11:06 EricGao888

I will submit a PR to add some more metrics related to task resource and alert server sometime this week.

EricGao888 avatar Jun 28 '22 01:06 EricGao888

I will submit a PR to add some more metrics related to task resource and alert server sometime this week.

Great Job.

lgcareer avatar Jun 28 '22 08:06 lgcareer

FYI, Prometheus Pushgateway is also supported by Micrometer: https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#actuator.metrics.export.prometheus

BTW, the StatsD registry eagerly pushes metrics over UDP to a StatsD agent: https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#actuator.metrics.export.statsd

For some metrics generated (built) during runtime, these two approaches may work.

EricGao888 avatar Jul 04 '22 07:07 EricGao888

Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~

EricGao888 avatar Jul 25 '22 00:07 EricGao888

Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~

I think it's better put into next version, because we are about to release 3.0.0-release, during this time, we only hope to cherry-pick the pr of bugfix.

caishunfeng avatar Jul 25 '22 02:07 caishunfeng

Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~

I think it's better put into next version, because we are about to release 3.0.0-release, during this time, we only hope to cherry-pick the pr of bugfix.

Sure, make sense to me. Thx~

EricGao888 avatar Jul 25 '22 02:07 EricGao888