trino
trino copied to clipboard
Add all_files system table to the Iceberg connector
Description
- This is to add
$all_files
system table support. - Spark already supports the iceberg's
$data_files
and$all_data_files
metadata tables. Trino is already supporting$files
. -
$all_files
table will include data_files from all the available snapshots to a table where the$files
metadata table includes only the current snapshot one.
Use cases
-
This can be used when the user is debugging data issue. As in case when Trino/Spark is used for metadata/data optimization, then it can modify the metadata and data.
-
Another case is when snapshot is rolled back/future from Trino/Spark and now trying to understand what all data-files are present, and is there any implication because of the optimization/rollback operations.
-
This and
$all_manifests
can also be used to add the optimization features in Trino like purging orphan files or identifying partitions modified since last time, to implement moving window data-compaction feature. detail Where we need to identify the
Design
Adding a new class AllFilesTable
and used it as a parent for FilesTable
, as both will be implementing similar responsibility,.
Testing
Added a test case in the existing TestIcebergSystemTable
.
Thought of using rollback
to show case that $all_files
can give all the data_files, but then history was getting updated and testHistoryTable
depends on the order and operations happen in the testAllFilesTable
test. So have not used it. If we think, I can add it.
Is this change a fix, improvement, new feature, refactoring, or other? Improvement
Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific) NA
How would you describe this change to a non-technical end user or system administrator? allows user to access all the data-files referred by the table ( rolled back snapshots, old snapshots). This is helpful for debugging issues like why my query is showing certain data and not current one (in case snapshots are rolled back)
syntax:
select * from "table$all_files"
Related issues, pull requests, and links
- Fixes 11172
Documentation
( ) No documentation is needed. ( ) Sufficient documentation is included in this PR. ( ) Documentation PR is available with #prnumber. ( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required. ( ) Release notes entries required with the following suggested text:
# Section
* Adds support for `all_files` metadata table, supported by Spark as well. ({issue}`11172`)
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:
- check if your git client is configured with an email to sign commits
git config --list | grep email
- If not, set it up using
git config --global user.email [email protected]
- Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:
- check if your git client is configured with an email to sign commits
git config --list | grep email
- If not, set it up using
git config --global user.email [email protected]
- Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
@alexjo2144 thanks for the review! please check, the refactored code.
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:
- check if your git client is configured with an email to sign commits
git config --list | grep email
- If not, set it up using
git config --global user.email [email protected]
- Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:
- check if your git client is configured with an email to sign commits
git config --list | grep email
- If not, set it up using
git config --global user.email [email protected]
- Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.
closing to fix the git email issue
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to [email protected]. For more information, see https://github.com/trinodb/cla.
@cla-bot check
The cla-bot has been summoned, and re-checked this pull request!
thanks a lot @martint!
@alexjo2144 would you mind checking it, and seeing if changes are looking ok now? Also, fixed a doc formatting and TestIcebergMetastoreAccessOperations.java as well. thanks!
@findinpath @mosabua thanks for the review! will try to add the above changes today/tomorrow.
Spark already supports the iceberg's $data_files and $all_data_files metadata tables. Trino is already supporting $files.
Spark has actually $files
and $all_data_files
. I would argue that (for consistency sake and to ease up the life of the end users using both Spark SQL and Iceberg) to follow the same naming schema
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:
- check if your git client is configured with an email to sign commits
git config --list | grep email
- If not, set it up using
git config --global user.email [email protected]
- Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
Spark already supports the iceberg's $data_files and $all_data_files metadata tables. Trino is already supporting $files.
Spark has actually
$files
and$all_data_files
. I would argue that (for consistency sake and to ease up the life of the end users using both Spark SQL and Iceberg) to follow the same naming schema
thats a good point. I'm fine with using $all_data_files
as well.
Though as we have $files
so from that perspective $all_files
looks ok.
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:
- check if your git client is configured with an email to sign commits
git config --list | grep email
- If not, set it up using
git config --global user.email [email protected]
- Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
@electrum request to please provide your feedback, thanks!
meanwhile will review the maven-check failures.
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:
- check if your git client is configured with an email to sign commits
git config --list | grep email
- If not, set it up using
git config --global user.email [email protected]
- Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
@electrum request to please provide your feedback, thanks!
meanwhile will review the maven-check failures.
fixed the maven failures as well, it was due to the doc formatting.
@osscm did you sign the CLA?
@osscm did you sign the CLA?
yes @findepi and @martint has updated it.
@cla-bot check
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Manish Malhotra. This is most likely caused by a git client misconfiguration; please make sure to:
- check if your git client is configured with an email to sign commits
git config --list | grep email
- If not, set it up using
git config --global user.email [email protected]
- Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
The cla-bot has been summoned, and re-checked this pull request!
There's one commit that's different from CLA signer. Please update the correct author. If it was also worked on by someone else, please add them as "co-authors". doc
@phd3 thanks for helping in review! That will help to merge it soon!
Build is red
https://github.com/trinodb/trino/runs/6319061876?check_suite_focus=true#step:7:3183
Error: COMPILATION ERROR :
[3184](https://github.com/trinodb/trino/runs/6319061876?check_suite_focus=true#step:7:3184)
[INFO] -------------------------------------------------------------
[3185](https://github.com/trinodb/trino/runs/6319061876?check_suite_focus=true#step:7:3185)
Error: /home/runner/work/trino/trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/AbstractFilesTable.java:[51,37] [MissingOverride] getDistribution implements method in SystemTable; expected @Override