trino icon indicating copy to clipboard operation
trino copied to clipboard

Iceberg split generation rate limit

Open marton-bod opened this issue 1 year ago • 44 comments

Description

Hive, Hudi and DeltaLake connectors already use the same approach implemented in this PR, whereby they first asynchronously load (stream) splits into an AsyncQueue (which can be configured with throttling settings), and then when getNextBatch() is periodically called from the SplitSource, splits are borrowed from this queue.

The primary aim is to control the rate at which splits are fetched for scheduling, thereby protecting the storage layer from getting overwhelmed by requests (e.g. namenodes in HDFS). This is achieved by a newly-introduced config flag (already present in Hive, Hudi and Delta): iceberg.max-splits-per-second.

Additional context and related issues

Solves the issue: https://github.com/trinodb/trino/issues/17974

The logic for filtering out splits should be unchanged: the pruning logic related to dynamic filter is in getNextBatch() since DF needs to be recomputed periodically. The rest of the pruning logic we can be applied to a split even before enqueueing it, therefore we can perform those checks within acceptScanTask() when we create the split stream.

Note: FileAwareScanTask is a new class introduced to keep metadata for FileScanTasks regarding their parent files. This is needed to replicate the current logic which stops the iteration if we hit enough records already for a LIMIT clause.

Release notes

( ) This is not user-visible or docs only and no release notes are required. ( ) Release notes are required, please propose a release note for me. (X) Release notes are required, with the following suggested text:

# Iceberg connector
* Introduce new config property to rate limit split generation in Iceberg table scans: iceberg.max-splits-per-second.

marton-bod avatar Jul 10 '23 14:07 marton-bod

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Jul 10 '23 20:07 cla-bot[bot]

@alexjo2144 @findinpath Just a friendly bump, I would really appreciate it if you could review this PR when you get the chance. Thank you!

marton-bod avatar Jul 17 '23 17:07 marton-bod

Given that this is modelled of implementation in other connectors this seems like a good approach. Could you potentially also look @electrum or @findepi

mosabua avatar Jul 18 '23 00:07 mosabua

Thanks a lot @mosabua for taking a first look. @electrum / @findepi please let me know if you have any thoughts, I do appreciate any input. Thanks a lot!

marton-bod avatar Jul 20 '23 10:07 marton-bod

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Jul 26 '23 16:07 cla-bot[bot]

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Jul 27 '23 14:07 cla-bot[bot]

@findepi Thanks so much for your time to review this. I've tried answering your questions as best as I could. Please let me know if you have any other questions/improvement ideas. Thanks!

marton-bod avatar Jul 27 '23 14:07 marton-bod

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Aug 01 '23 11:08 cla-bot[bot]

thanks @findepi for the help with the review!

we will also be running tpcds benchmarking with this change and share results here. Internally we have not added a flag, so rate limit is the default behavior for iceberg, as we use in Hive.

osscm avatar Aug 01 '23 17:08 osscm

Thanks for the reviews @findepi. Other connectors also don't seem to have a toggle for rate limit vs no rate limit. We can share the results from our TPCDS runs as well. cc @electrum @Parth-Brahmbhatt Would appreciate reviews from you too and if you think this should be behind a feature flag. Thanks!

vgankidi avatar Aug 01 '23 19:08 vgankidi

We have run TPC-DS 10000 (iceberg-parquet files in S3) with both our internal fork of 421, as well as 421 + this patch, and the results were identical (<1% difference) cc @findepi @electrum

marton-bod avatar Aug 04 '23 10:08 marton-bod

@findepi @ebyhr @electrum just a friendly check-in to see if you could please take another look at this PR? Would greatly appreciate any comments, opinions or guidance from all you. Thanks a lot!

marton-bod avatar Aug 11 '23 10:08 marton-bod

@findepi @ebyhr @electrum just a friendly check-in to see if you could please take another look at this PR?

vgankidi avatar Sep 12 '23 22:09 vgankidi

Do we need commits squashed or something else done to progress this PR @findepi @electrum @martint ?

mosabua avatar Sep 12 '23 22:09 mosabua

@mosabua Thanks a lot for the ping! I have rebased the PR and squashed the commits so it's ready to go again. If the tagged maintainers are too busy, is there anyone else we can involve?

marton-bod avatar Sep 25 '23 18:09 marton-bod

@raunaqmorarka @findepi I have rebased to the latest master - can you please give this a review? It's been inactive for some time now so would be great to make some progress. Thank you!

marton-bod avatar Oct 25 '23 09:10 marton-bod

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Nov 07 '23 17:11 cla-bot[bot]

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Nov 07 '23 17:11 cla-bot[bot]

@findepi @ebyhr @electrum @raunaqmorarka just a friendly check-in to see if you could review the latest changes?

vgankidi avatar Nov 16 '23 07:11 vgankidi

I've had some conversations offline with @raunaqmorarka , who is currently on vacation but will check the PR once he's back next week.

marton-bod avatar Nov 16 '23 14:11 marton-bod

@raunaqmorarka I've addressed your latest comments - please take another look to see if it's good to go now. Thanks! The only test failure is unrelated (TestHiveTransactionalTable > testReadFullAcid)

marton-bod avatar Dec 05 '23 16:12 marton-bod

@raunaqmorarka Thanks again for reviewing this PR. I've addressed your latest comments. The tests are green, so it's ready for another pass. Couple of notes:

  1. I have not turned num-outstanding-splits into a config yet, just increased its value to 1000. Let me know if you prefer to keep it as a static variable or a config. We can always introduce a config in a follow up PR too, if needed.
  2. I have refactored the helper commits as per your suggestion into 3 separate commits (one for extracting a single method from IcebergSplitSource) - this should be much easier to follow now.

Please let me know if you have further comments. Thank you!

marton-bod avatar Dec 12 '23 15:12 marton-bod

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions[bot] avatar Jan 18 '24 17:01 github-actions[bot]

I assume you are still working on this @marton-bod and @raunaqmorarka ... let me know if you need any help from others.

mosabua avatar Jan 18 '24 17:01 mosabua

I believe we need a re-review here.

vgankidi avatar Jan 29 '24 21:01 vgankidi

@findepi @raunaqmorarka @electrum could you help out here? also cc @brandylove

mosabua avatar Jan 29 '24 22:01 mosabua

This needs a rebase to latest master first

raunaqmorarka avatar Jan 30 '24 05:01 raunaqmorarka

@raunaqmorarka I have rebased to the latest master. Please take a look again, it should be good to go now. Thanks a lot!

marton-bod avatar Jan 30 '24 17:01 marton-bod

@raunaqmorarka I've rebased this patch once again. Please take a look when you have a chance - thank you.

marton-bod avatar Feb 09 '24 13:02 marton-bod

Can anyone please review this? I've been having to rebase this patch again and again for a very long time now. Thank you.

marton-bod avatar Feb 14 '24 12:02 marton-bod