trino icon indicating copy to clipboard operation
trino copied to clipboard

Add all_files system table to the Iceberg connector

Open osscm opened this issue 2 years ago • 53 comments

Description

  • This is to add $all_files system table support.
  • Spark already supports the iceberg's $data_files and $all_data_files metadata tables. Trino is already supporting $files.
  • $all_files table will include data_files from all the available snapshots to a table where the $files metadata table includes only the current snapshot one.

Use cases

  • This can be used when the user is debugging data issue. As in case when Trino/Spark is used for metadata/data optimization, then it can modify the metadata and data.

  • Another case is when snapshot is rolled back/future from Trino/Spark and now trying to understand what all data-files are present, and is there any implication because of the optimization/rollback operations.

  • This and $all_manifests can also be used to add the optimization features in Trino like purging orphan files or identifying partitions modified since last time, to implement moving window data-compaction feature. detail Where we need to identify the

Design Adding a new class AllFilesTable and used it as a parent for FilesTable, as both will be implementing similar responsibility,.

Testing Added a test case in the existing TestIcebergSystemTable. Thought of using rollback to show case that $all_files can give all the data_files, but then history was getting updated and testHistoryTable depends on the order and operations happen in the testAllFilesTable test. So have not used it. If we think, I can add it.

Is this change a fix, improvement, new feature, refactoring, or other? Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific) NA

How would you describe this change to a non-technical end user or system administrator? allows user to access all the data-files referred by the table ( rolled back snapshots, old snapshots). This is helpful for debugging issues like why my query is showing certain data and not current one (in case snapshots are rolled back)

syntax: select * from "table$all_files"

Related issues, pull requests, and links

Documentation

( ) No documentation is needed. ( ) Sufficient documentation is included in this PR. ( ) Documentation PR is available with #prnumber. ( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required. ( ) Release notes entries required with the following suggested text:

# Section
* Adds support for `all_files` metadata table, supported by Spark as well. ({issue}`11172`)

osscm avatar Feb 26 '22 00:02 osscm

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

cla-bot[bot] avatar Feb 26 '22 00:02 cla-bot[bot]

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Mar 01 '22 06:03 cla-bot[bot]

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Mar 01 '22 06:03 cla-bot[bot]

@alexjo2144 thanks for the review! please check, the refactored code.

osscm avatar Mar 01 '22 06:03 osscm

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Mar 01 '22 09:03 cla-bot[bot]

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Mar 01 '22 09:03 cla-bot[bot]

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

cla-bot[bot] avatar Mar 01 '22 20:03 cla-bot[bot]

closing to fix the git email issue

osscm avatar Mar 01 '22 20:03 osscm

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

cla-bot[bot] avatar Mar 02 '22 08:03 cla-bot[bot]

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.

cla-bot[bot] avatar Mar 02 '22 08:03 cla-bot[bot]

@cla-bot check

martint avatar Mar 02 '22 21:03 martint

The cla-bot has been summoned, and re-checked this pull request!

cla-bot[bot] avatar Mar 02 '22 21:03 cla-bot[bot]

thanks a lot @martint!

osscm avatar Mar 03 '22 17:03 osscm

@alexjo2144 would you mind checking it, and seeing if changes are looking ok now? Also, fixed a doc formatting and TestIcebergMetastoreAccessOperations.java as well. thanks!

osscm avatar Mar 03 '22 17:03 osscm

@findinpath @mosabua thanks for the review! will try to add the above changes today/tomorrow.

osscm avatar Mar 10 '22 21:03 osscm

Spark already supports the iceberg's $data_files and $all_data_files metadata tables. Trino is already supporting $files.

Spark has actually $files and $all_data_files. I would argue that (for consistency sake and to ease up the life of the end users using both Spark SQL and Iceberg) to follow the same naming schema

findinpath avatar Mar 11 '22 06:03 findinpath

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Mar 11 '22 08:03 cla-bot[bot]

Spark already supports the iceberg's $data_files and $all_data_files metadata tables. Trino is already supporting $files.

Spark has actually $files and $all_data_files. I would argue that (for consistency sake and to ease up the life of the end users using both Spark SQL and Iceberg) to follow the same naming schema

thats a good point. I'm fine with using $all_data_files as well.

Though as we have $files so from that perspective $all_files looks ok.

osscm avatar Mar 11 '22 09:03 osscm

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Mar 12 '22 09:03 cla-bot[bot]

@electrum request to please provide your feedback, thanks!

meanwhile will review the maven-check failures.

osscm avatar Mar 16 '22 18:03 osscm

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Mar 23 '22 21:03 cla-bot[bot]

@electrum request to please provide your feedback, thanks!

meanwhile will review the maven-check failures.

fixed the maven failures as well, it was due to the doc formatting.

osscm avatar Mar 25 '22 05:03 osscm

@osscm did you sign the CLA?

findepi avatar Mar 25 '22 12:03 findepi

@osscm did you sign the CLA?

yes @findepi and @martint has updated it.

osscm avatar Mar 30 '22 00:03 osscm

@cla-bot check

phd3 avatar Apr 08 '22 18:04 phd3

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

cla-bot[bot] avatar Apr 08 '22 18:04 cla-bot[bot]

The cla-bot has been summoned, and re-checked this pull request!

cla-bot[bot] avatar Apr 08 '22 18:04 cla-bot[bot]

There's one commit that's different from CLA signer. Please update the correct author. If it was also worked on by someone else, please add them as "co-authors". doc

phd3 avatar Apr 08 '22 18:04 phd3

@phd3 thanks for helping in review! That will help to merge it soon!

osscm avatar Apr 09 '22 00:04 osscm

Build is red

https://github.com/trinodb/trino/runs/6319061876?check_suite_focus=true#step:7:3183

Error:  COMPILATION ERROR : 
[3184](https://github.com/trinodb/trino/runs/6319061876?check_suite_focus=true#step:7:3184)
[INFO] -------------------------------------------------------------
[3185](https://github.com/trinodb/trino/runs/6319061876?check_suite_focus=true#step:7:3185)
Error:  /home/runner/work/trino/trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/AbstractFilesTable.java:[51,37] [MissingOverride] getDistribution implements method in SystemTable; expected @Override

findinpath avatar May 07 '22 04:05 findinpath