trino
trino copied to clipboard
Added support for sorted_by while creating iceberg table
Description
This PR is to allow to provide sorted_by
as the properties like partitioning.
The sorted field definition will follow these rules:
- Sorted field needs to be present as the valid column.
- Allow quoted identifier similar to https://github.com/trinodb/trino/pull/12227 , borrowed pattern, and 2 tests methods, thanks @findepi and @mdesmet
- Allow spaces like
sorted_by = ARRAY['c1', ' year( a_date ) ')
Is this change a fix, improvement, new feature, refactoring, or other?
improvement, new feature
Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)
the Iceberg connector
How would you describe this change to a non-technical end user or system administrator?
Add sorted_by
properties while creating an Iceberg table like the Hive table.
CREATE TABLE test_table (
c1 integer,
c2 varchar,
c3 double)
WITH (
format = 'PARQUET',
sorted_by = ARRAY['c1', 'c2'],
location = '/var/my_tables/test_table')
Related issues, pull requests, and links
Fixes https://github.com/trinodb/trino/issues/12447
Documentation
( ) No documentation is needed. ( ) Sufficient documentation is included in this PR. ( ) Documentation PR is available with #prnumber. ( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required. ( ) Release notes entries required with the following suggested text:
# Section
* Fix some things. ({issue}`https://github.com/trinodb/trino/issues/12447`)
Will add changes to the doc as well.
cc @findepi @alexjo2144 thanks!
Rather than copying the code from @mdesmet 's PR, can you cherry-pick the commit from that branch and then add work in commits on top of it?
Rather than copying the code from @mdesmet 's PR, can you cherry-pick the commit from that branch and then add work in commits on top of it?
Thanks, @alexjo2144
tried that first, but the problem was @mdesmet code was changing only the Partition-related classes only like PartitionFields, TestPartitionFields, and BaseIcebergConnectorTest (changing partition-related test-code)
So, cherry-picked those commits, will add the code changes for the Partitions related, and then have to change it, as my PR will not push anything partition related.
The borrowed code is in IcebergUtil
- common methods
toIdentifier
andfromIdentifier
- and common regex
In this PR, we are adding new test methods and classes (SortFields, TestSortFields
), so we have to create a parallel to the PartitionFields
as the SortFields
.
Not sure about the process, but thought of adding @mdesmet as the co-author if that make sense.
The borrowed code is in
IcebergUtil
- common methods
toIdentifier
andfromIdentifier
- and common regex
I had also raised this in the PR. This would also apply to orc_bloom_filter defintions etc, i'll move this code to IcebergUtil
in a separate commit. That would probably make this easier to cherrypick.
The borrowed code is in
IcebergUtil
- common methods
toIdentifier
andfromIdentifier
- and common regex
I had also raised this in the PR. This would also apply to orc_bloom_filter defintions etc, i'll move this code to
IcebergUtil
in a separate commit. That would probably make this easier to cherrypick.
sure @mdesmet, I can pull only that commit which has the change only to the IcebergUtil
, thanks!
The borrowed code is in
IcebergUtil
- common methods
toIdentifier
andfromIdentifier
- and common regex
I had also raised this in the PR. This would also apply to orc_bloom_filter defintions etc, i'll move this code to
IcebergUtil
in a separate commit. That would probably make this easier to cherrypick.sure @mdesmet, I can pull only that commit which has the change only to the
IcebergUtil
, thanks!
@mdesmet https://github.com/trinodb/trino/pull/12227/commits seems to the commit only toIdentifier
and fromIdentifier
methods in IcebergUtil
is removed, as I was trying to cherry-pick that. Please see, if you have time to add it, otherwise I can go ahead with the borrowed one, and mentioned as I did. thanks
@mdesmet https://github.com/trinodb/trino/pull/12227/commits seems to the commit only
toIdentifier
andfromIdentifier
methods inIcebergUtil
is removed, as I was trying to cherry-pick that. Please see, if you have time to add it, otherwise I can go ahead with the borrowed one, and mentioned as I did. thanks
@osscm: Those commits were squashed based on the feedback in the PR. I still suggest you to cherry pick the relevant commit and reuse the provided methods in IcebergUtil
. However it's not guaranteed that you won't have to reapply the commit again it if some new feedback is raised.
This PR has been inactive for 4 weeks. What's the status here - do we need it changed, or is it waiting on that other PR for direction?
cc @ebyhr
This PR has been inactive for 4 weeks. What's the status here - do we need it changed, or is it waiting on that other PR for direction?
cc @ebyhr
@colebow I have to resume on this PR if there are any pending commits.
https://github.com/trinodb/trino/pull/12227 PR was using quote identifier in partitions, I have tried to use the commit from that PR, but it was not mergeable. As that PR is approved, May be can wait for the merge.
i merged https://github.com/trinodb/trino/pull/12227 (thanks @mdesmet !) @osscm can you please rebase?
i merged #12227 (thanks @mdesmet !) @osscm can you please rebase?
sure @findepi
it seems the PR adds ability to declare sorted_by
, but doesn't implement actual sorting.
It's a good step, but not something we would want to release, so not something we would want to merge.
@osscm do you also want to implement sorting writer? or I can probably find someone on my team to work on this, if you prefer this way.
Sure @findpi I can also work on implementing the writer logic as well.
On Mon, Aug 22, 2022 at 11:52 AM Piotr Findeisen @.***> wrote:
it seems the PR adds ability to declare sorted_by, but doesn't implement actual sorting. It's a good step, but not something we would want to release, so not something we would want to merge. @osscm https://github.com/osscm do you also want to implement sorting writer? or I can probably find someone on my team to work on this, if you prefer this way.
— Reply to this email directly, view it on GitHub https://github.com/trinodb/trino/pull/12872#issuecomment-1222782053, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXQ2PYSYXYJX3SSIHNJTA2DV2PD7LANCNFSM5Y427YJQ . You are receiving this because you were mentioned.Message ID: @.***>
Sure @homar, will change it, it should be sortedOrder, and not partitioning.
On Tue, Sep 6, 2022 at 2:53 PM Homar @.***> wrote:
@.**** commented on this pull request.
In plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTableProperties.java https://github.com/trinodb/trino/pull/12872#discussion_r964210266:
@@ -121,6 +131,13 @@ public static List<String> getPartitioning(Map<String, Object> tableProperties) return partitioning == null ? ImmutableList.of() : ImmutableList.copyOf(partitioning); }
- @SuppressWarnings("unchecked")
- public static List<String> getSortOrder(Map<String, Object> tableProperties)
- {
List<String> partitioning = (List<String>) tableProperties.get(SORTED_BY_PROPERTY);
I am little confused by this naming, it is column order or something like that not partitioning, or I don't understand something
— Reply to this email directly, view it on GitHub https://github.com/trinodb/trino/pull/12872#pullrequestreview-1098298685, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXQ2PYUKAVL3GMDN67NYIMTV464O7ANCNFSM5Y427YJQ . You are receiving this because you were mentioned.Message ID: @.***>
@alexjo2144 is it possible to start the review for the DDL part, meanwhile I'll add the writer part as well. It will help to get going, and get early feedback, as overall scope is also a bit bigger.
it seems the PR adds ability to declare
sorted_by
, but doesn't implement actual sorting. It's a good step, but not something we would want to release, so not something we would want to merge. @osscm do you also want to implement sorting writer? or I can probably find someone on my team to work on this, if you prefer this way.
@findepi do you think, if we can keep this PR small so that it's easy to review and merge. and then in the 2nd PR we can support Sorting Writer. If PR is bit bigger it becomes difficult to review and merge. What do you think.
if we can keep this PR small so that it's easy to review and merge. and then in the 2nd PR we can support Sorting Writer.
Yes, it makes sense to keep changes separate, as in separate commits.
However, we don't like to expose user-facing functionality that does nothing. If we merge this PR, we expose Trino Iceberg connector table property for defining sort, which is ignored by Trino. That makes Trino users confused and sad. We want to expose this to users when it actually works.
@alexjo2144 could you please help here? would you be able to implement a sorting writer on top of this PR?
if we can keep this PR small so that it's easy to review and merge. and then in the 2nd PR we can support Sorting Writer.
Yes, it makes sense to keep changes separate, as in separate commits.
However, we don't like to expose user-facing functionality that does nothing. If we merge this PR, we expose Trino Iceberg connector table property for defining sort, which is ignored by Trino. That makes Trino users confused and sad. We want to expose this to users when it actually works.
@alexjo2144 could you please help here? would you be able to implement a sorting writer on top of this PR?
sure @findepi
Please let me know if any changes are required to support sorted_by
for DDL statements.
Superseded by https://github.com/trinodb/trino/pull/14891