trino icon indicating copy to clipboard operation
trino copied to clipboard

Added support for sorted_by while creating iceberg table

Open osscm opened this issue 2 years ago • 10 comments

Description

This PR is to allow to provide sorted_by as the properties like partitioning.

The sorted field definition will follow these rules:

  1. Sorted field needs to be present as the valid column.
  2. Allow quoted identifier similar to https://github.com/trinodb/trino/pull/12227 , borrowed pattern, and 2 tests methods, thanks @findepi and @mdesmet
  3. Allow spaces like
    sorted_by = ARRAY['c1', '  year(    a_date )   ')

Is this change a fix, improvement, new feature, refactoring, or other?

improvement, new feature

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

the Iceberg connector

How would you describe this change to a non-technical end user or system administrator?

Add sorted_by properties while creating an Iceberg table like the Hive table.

CREATE TABLE test_table (
    c1 integer,
    c2 varchar,
    c3 double)
WITH (
    format = 'PARQUET',
    sorted_by = ARRAY['c1', 'c2'],
    location = '/var/my_tables/test_table')

Related issues, pull requests, and links

Fixes https://github.com/trinodb/trino/issues/12447

Documentation

( ) No documentation is needed. ( ) Sufficient documentation is included in this PR. ( ) Documentation PR is available with #prnumber. ( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required. ( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`https://github.com/trinodb/trino/issues/12447`)

Will add changes to the doc as well.

osscm avatar Jun 15 '22 22:06 osscm

cc @findepi @alexjo2144 thanks!

osscm avatar Jun 15 '22 23:06 osscm

Rather than copying the code from @mdesmet 's PR, can you cherry-pick the commit from that branch and then add work in commits on top of it?

alexjo2144 avatar Jun 16 '22 19:06 alexjo2144

Rather than copying the code from @mdesmet 's PR, can you cherry-pick the commit from that branch and then add work in commits on top of it?

Thanks, @alexjo2144  tried that first, but the problem was @mdesmet code was changing only the Partition-related classes only like PartitionFields, TestPartitionFields, and BaseIcebergConnectorTest (changing partition-related test-code) So, cherry-picked those commits, will add the code changes for the Partitions related, and then have to change it, as my PR will not push anything partition related.

The borrowed code is in IcebergUtil 

  • common methods toIdentifier and fromIdentifier 
  • and common regex 

In this PR, we are adding new test methods and classes (SortFields, TestSortFields), so we have to create a parallel to the PartitionFields as the SortFields.

Not sure about the process, but thought of adding @mdesmet as the co-author if that make sense.

osscm avatar Jun 17 '22 17:06 osscm

The borrowed code is in IcebergUtil 

  • common methods toIdentifier and fromIdentifier
  • and common regex

I had also raised this in the PR. This would also apply to orc_bloom_filter defintions etc, i'll move this code to IcebergUtil in a separate commit. That would probably make this easier to cherrypick.

mdesmet avatar Jun 17 '22 18:06 mdesmet

The borrowed code is in IcebergUtil 

  • common methods toIdentifier and fromIdentifier
  • and common regex

I had also raised this in the PR. This would also apply to orc_bloom_filter defintions etc, i'll move this code to IcebergUtil in a separate commit. That would probably make this easier to cherrypick.

sure @mdesmet, I can pull only that commit which has the change only to the IcebergUtil, thanks!

osscm avatar Jun 17 '22 18:06 osscm

The borrowed code is in IcebergUtil 

  • common methods toIdentifier and fromIdentifier
  • and common regex

I had also raised this in the PR. This would also apply to orc_bloom_filter defintions etc, i'll move this code to IcebergUtil in a separate commit. That would probably make this easier to cherrypick.

sure @mdesmet, I can pull only that commit which has the change only to the IcebergUtil, thanks!

@mdesmet https://github.com/trinodb/trino/pull/12227/commits seems to the commit only toIdentifier and fromIdentifier methods in IcebergUtil is removed, as I was trying to cherry-pick that. Please see, if you have time to add it, otherwise I can go ahead with the borrowed one, and mentioned as I did. thanks

osscm avatar Jul 04 '22 08:07 osscm

@mdesmet https://github.com/trinodb/trino/pull/12227/commits seems to the commit only toIdentifier and fromIdentifier methods in IcebergUtil is removed, as I was trying to cherry-pick that. Please see, if you have time to add it, otherwise I can go ahead with the borrowed one, and mentioned as I did. thanks

@osscm: Those commits were squashed based on the feedback in the PR. I still suggest you to cherry pick the relevant commit and reuse the provided methods in IcebergUtil. However it's not guaranteed that you won't have to reapply the commit again it if some new feedback is raised.

mdesmet avatar Jul 04 '22 16:07 mdesmet

This PR has been inactive for 4 weeks. What's the status here - do we need it changed, or is it waiting on that other PR for direction?

cc @ebyhr

colebow avatar Aug 01 '22 16:08 colebow

This PR has been inactive for 4 weeks. What's the status here - do we need it changed, or is it waiting on that other PR for direction?

cc @ebyhr

@colebow I have to resume on this PR if there are any pending commits.

https://github.com/trinodb/trino/pull/12227 PR was using quote identifier in partitions, I have tried to use the commit from that PR, but it was not mergeable. As that PR is approved, May be can wait for the merge.

osscm avatar Aug 02 '22 15:08 osscm

i merged https://github.com/trinodb/trino/pull/12227 (thanks @mdesmet !) @osscm can you please rebase?

findepi avatar Aug 08 '22 12:08 findepi

i merged #12227 (thanks @mdesmet !) @osscm can you please rebase?

sure @findepi

osscm avatar Aug 22 '22 14:08 osscm

it seems the PR adds ability to declare sorted_by, but doesn't implement actual sorting. It's a good step, but not something we would want to release, so not something we would want to merge. @osscm do you also want to implement sorting writer? or I can probably find someone on my team to work on this, if you prefer this way.

findepi avatar Aug 22 '22 18:08 findepi

Sure @findpi I can also work on implementing the writer logic as well.

On Mon, Aug 22, 2022 at 11:52 AM Piotr Findeisen @.***> wrote:

it seems the PR adds ability to declare sorted_by, but doesn't implement actual sorting. It's a good step, but not something we would want to release, so not something we would want to merge. @osscm https://github.com/osscm do you also want to implement sorting writer? or I can probably find someone on my team to work on this, if you prefer this way.

— Reply to this email directly, view it on GitHub https://github.com/trinodb/trino/pull/12872#issuecomment-1222782053, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXQ2PYSYXYJX3SSIHNJTA2DV2PD7LANCNFSM5Y427YJQ . You are receiving this because you were mentioned.Message ID: @.***>

osscm avatar Aug 24 '22 01:08 osscm

Sure @homar, will change it, it should be sortedOrder, and not partitioning.

On Tue, Sep 6, 2022 at 2:53 PM Homar @.***> wrote:

@.**** commented on this pull request.

In plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTableProperties.java https://github.com/trinodb/trino/pull/12872#discussion_r964210266:

@@ -121,6 +131,13 @@ public static List<String> getPartitioning(Map<String, Object> tableProperties) return partitioning == null ? ImmutableList.of() : ImmutableList.copyOf(partitioning); }

  • @SuppressWarnings("unchecked")
  • public static List<String> getSortOrder(Map<String, Object> tableProperties)
  • {
  •    List<String> partitioning = (List<String>) tableProperties.get(SORTED_BY_PROPERTY);
    

I am little confused by this naming, it is column order or something like that not partitioning, or I don't understand something

— Reply to this email directly, view it on GitHub https://github.com/trinodb/trino/pull/12872#pullrequestreview-1098298685, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXQ2PYUKAVL3GMDN67NYIMTV464O7ANCNFSM5Y427YJQ . You are receiving this because you were mentioned.Message ID: @.***>

osscm avatar Sep 07 '22 06:09 osscm

@alexjo2144 is it possible to start the review for the DDL part, meanwhile I'll add the writer part as well. It will help to get going, and get early feedback, as overall scope is also a bit bigger.

osscm avatar Sep 22 '22 08:09 osscm

it seems the PR adds ability to declare sorted_by, but doesn't implement actual sorting. It's a good step, but not something we would want to release, so not something we would want to merge. @osscm do you also want to implement sorting writer? or I can probably find someone on my team to work on this, if you prefer this way.

@findepi do you think, if we can keep this PR small so that it's easy to review and merge. and then in the 2nd PR we can support Sorting Writer. If PR is bit bigger it becomes difficult to review and merge. What do you think.

osscm avatar Sep 27 '22 06:09 osscm

if we can keep this PR small so that it's easy to review and merge. and then in the 2nd PR we can support Sorting Writer.

Yes, it makes sense to keep changes separate, as in separate commits.

However, we don't like to expose user-facing functionality that does nothing. If we merge this PR, we expose Trino Iceberg connector table property for defining sort, which is ignored by Trino. That makes Trino users confused and sad. We want to expose this to users when it actually works.

@alexjo2144 could you please help here? would you be able to implement a sorting writer on top of this PR?

findepi avatar Oct 13 '22 11:10 findepi

if we can keep this PR small so that it's easy to review and merge. and then in the 2nd PR we can support Sorting Writer.

Yes, it makes sense to keep changes separate, as in separate commits.

However, we don't like to expose user-facing functionality that does nothing. If we merge this PR, we expose Trino Iceberg connector table property for defining sort, which is ignored by Trino. That makes Trino users confused and sad. We want to expose this to users when it actually works.

@alexjo2144 could you please help here? would you be able to implement a sorting writer on top of this PR?

sure @findepi Please let me know if any changes are required to support sorted_by for DDL statements.

osscm avatar Oct 15 '22 01:10 osscm

Superseded by https://github.com/trinodb/trino/pull/14891

findepi avatar Nov 14 '22 22:11 findepi