flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-37329][table-planner] Skip Source Stats Collection When table.optimizer.source.report-statistics-enabled is False

Open shameersss1 opened this issue 10 months ago • 12 comments

What is the purpose of the change

Currently when "table.optimizer.source.report-statistics-enabled" is set to false, The statistics collection is not disabled for all the cases. It was noted that when running Batch workload to read Hive table TPC-DS data set, although "table.optimizer.source.report-statistics-enabled" was set to false, both table and column statistics were being collected.

Brief change log

Skipping stats computation in FlinkRecomputeStatisticsProgram.java when "table.optimizer.source.report-statistics-enabled" is false

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Additionally ran the following test [INFO] [INFO] ------------------------------------------------------- [INFO] T E S T S [INFO] ------------------------------------------------------- [INFO] Running org.apache.flink.connector.file.table.FileSystemStatisticsReportTest [INFO] Tests run: 17, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.578 s -- in org.apache.flink.connector.file.table.FileSystemStatisticsReportTest [INFO] [INFO] Results: [INFO] [INFO] Tests run: 17, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 05:09 min [INFO] Finished at: 2025-02-15T11:51:58+05:30 [INFO] ------------------------------------------------------------------------

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

shameersss1 avatar Feb 15 '25 08:02 shameersss1

CI report:

  • 3b79dd61d83e7894d806901120a66067486db35b Azure: SUCCESS
  • c79f478cf664484e608c11cc5b07c646abd8b829 UNKNOWN
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Feb 15 '25 08:02 flinkbot

@reswqa @JunRuiLee @dawidwys Could you please review the changes ?

shameersss1 avatar Feb 17 '25 06:02 shameersss1

Please could you add a unit test.

Sure will add a UT for this

shameersss1 avatar Feb 20 '25 13:02 shameersss1

@davidradl - I have addressed your comments. Could you please review the same ?

shameersss1 avatar Feb 21 '25 04:02 shameersss1

@twalthr @JunRuiLee Could you please review the changes ?

shameersss1 avatar Feb 24 '25 04:02 shameersss1

@twalthr @JunRuiLee Could you please review the changes ?

Sorry @shameersss1 I am not very familiar with this part of the logic, maybe @xuyangzhong can provide some suggestions.

JunRuiLee avatar Feb 24 '25 06:02 JunRuiLee

Thanks @JunRuiLee for the pointers. @davidradl @xuyangzhong Could you please review the changes

shameersss1 avatar Feb 24 '25 06:02 shameersss1

@dawidwys @twalthr @xuyangzhong - Gentle reminder for review

shameersss1 avatar Feb 27 '25 12:02 shameersss1

Thanks @davidradl for the review.

@JunRuiLee - Could you please point to anyone else who knows this flow and can do the review ?

shameersss1 avatar Mar 10 '25 06:03 shameersss1

getPartitionsTableStats

Thanks a lot @xuyangzhong for the review

  1. Yes, you are correct, it never stated it skips stats collection from catalog.
  2. Fetching stats from catalog may be a good option for all the cases, in some cases it is better to just turn it off.
  3. inorder to do the same, i propose, let;s reuse the same config and skip stats alltogether both for source and catalog or introduce a different config to do the same.

@xuyangzhong Any thoughts on the above ?

shameersss1 avatar Mar 10 '25 12:03 shameersss1

@shameersss1 Whether we modify the scope of the current configuration or introduce a new one, it's advisable to implement changes through a Flip, as these configurations are part of the public API.

xuyangzhong avatar Mar 11 '25 01:03 xuyangzhong

This PR is being marked as stale since it has not had any activity in the last 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out to the community, contact details can be found here: https://flink.apache.org/what-is-flink/community/

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Jun 09 '25 06:06 github-actions[bot]