flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-33681] Reuse input/output metrics of SourceOperator/SinkWriterOperator for task

Open X-czh opened this issue 7 months ago • 3 comments

What is the purpose of the change

Currently, the numRecordsIn & numBytesIn metrics for sources and the numRecordsOut & numBytesOut metrics for sinks are always 0 on the Flink web dashboard.

FLINK-11576 brings us these metrics on the opeartor level, but it does not integrate them on the task level. On the other hand, the summay metrics on the job overview page is based on the task level I/O metrics. As a result, even though new connectors supporting FLIP-33 metrics will report operator-level I/O metrics, we still cannot see the metrics on dashboard.

This MR attempts to reuse the operator-level source/sink I/O metrics for task so that they can be viewed on Flink web dashboard.

Brief change log

Reuse input/output metrics of SourceOperator/SinkWriterOperator for task.

Since Flink only accounts for internal traffic for input/output bytes metrics before, the reuse won't cause duplication in the I/O bytes metrics. Also, as the output records metric is intentionally dropped for SinkWriterOperator in OperatorChain#getOperatorRecordsOutCounter and no input records metric is collected for SourceOperatorStreamTask, no duplication in the I/O records metrics will take place.

Verifying this change

Manually run Kafka2Kafka job on a testing cluster on K8s, verified that the source/sink input/output metrics can be seen on the web dashboard.

image

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

X-czh avatar Dec 27 '23 06:12 X-czh

CI report:

  • 662dea4e946cb0f2e287dcd7048897b93d9a46aa Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Dec 27 '23 06:12 flinkbot

@affo Thanks! @huwh Could you help take a look at the PR?

X-czh avatar Apr 22 '24 02:04 X-czh

Hi @gyfora, do you think it a good way to go? Many Flink beginners have been complaining about this here and there for years. Given that we've already have the right stats in operator-level metrics, it'll be good to expose it on UI directly.

X-czh avatar May 23 '24 13:05 X-czh

@becketqin Thanks for reviewing. I'll add UTs to cover the change and I've already raised FLIP-460 for it, but no discussion is received yet. Could you help review the FLIP? cc @affo @roncohen

X-czh avatar Jul 12 '24 09:07 X-czh

@becketqin Hi, I've added test cases and rebased it onto the latest master. Could you take a look again?

X-czh avatar Jul 28 '24 22:07 X-czh