iceberg API: Add view interfaces

Co-authored-by: [email protected]

Jun 01 '22 04:06 jzhuge

Split from #4567

Jun 01 '22 04:06 jzhuge

@jzhuge I think you wanted to link to #4657 here :)

Jun 01 '22 10:06 nastra

Thanks @amogh-jahagirdar.

BTW, I am also ok if we decide to remove UpdateViewProperties and just use the existing UpdateProperties.

Jun 20 '22 06:06 jzhuge

@rdblue do you have any additional comment for adding these view APIs?

Jun 20 '22 15:06 jackye1995

If no more comment, is it possible to get it merged so that I can use it in the Core PR?

Jun 30 '22 14:06 jzhuge

Hi @danielcweeks, could you please take a look and merge if it looks ok?

Jul 19 '22 16:07 jzhuge

Hi @danielcweeks, have you got a chance to take a look?

Aug 15 '22 16:08 jzhuge

@jzhuge or @anjalinorwood I feel like I'm missing something in terms of where we actually access the view content / text? I would think that would live in ViewRepresentation or ViewVersion?

Aug 15 '22 17:08 danielcweeks

@danielcweeks I'll let @jzhuge @anjalinorwood confirm but the model at least in my mind is that there is an implementation of ViewRepresentation; for the SQL view representation, this would be SqlViewRepresentation which would expose a sql() method for surfacing the literal text. Down the line there would be other view representation types such as for substrait which would expose the view representation in their own way (serialized plan node for example).

Aug 21 '22 19:08 amogh-jahagirdar

Merged Amogh's PR, rebased, and applied spotless.

Aug 30 '22 21:08 jzhuge

@danielcweeks @stevenzwu @rdblue Any more comments on the API changes? Would be nice so that we could get to the core implementation.

Sep 11 '22 08:09 amogh-jahagirdar

Follow up from our discussion in yesterday's sync.

Create View Semantics

I think we'll probably need to better define what createView exactly entails. There are 2 cases I can think of that should be handled in createView.

CreateView can either:

1.) Create a completely new view with an initial set of representations for the view. i.e. the view is not even defined in the catalog

OR

2.) Add new representation(s) to an existing view if and only if the new representations or sql dialects don't already exist in the current view (if they exist this means that a user must replace the view). This case should still increment the view version.

This could be a separate API AddViewRepresentation but this puts more work on a caller to identify if a given view representation or dialect already exists in a view which does not seem like the right behavior.

Schema Validation of different representations

There's another issue here for if and how does CreateView validate that the representations all have the expected schema for case 2? Considering a view represents a result of a computation there must be a single schema. Following this, ideally CreateView should validate that the schema is the same for the representations.

This is challenging considering different representations and SQL dialects. In the long run, I think projects like Coral or Substrait would be used to help perform the schema validation across the different SQL dialects. As an extension, even there could be a Substrait View Representation down the line.

@wmoustafa @jacques-n Any recommendations on a good way for handling this schema validation within the view format itself? I'm taking a look in Coral and Substrait (apologies if this is a naive question)

For addressing schema validation in the short term, I think in createView API engines should pass in the schema for the dialects. Then createView performs a validation that the passed in schema aligns with the view schema. This could be bypassed by an engine just passing in the view.schema() even though it doesn't actually align with the view definition but I think that handling that case isn't required now.

Happy to get thoughts and suggestions on this! @jzhuge @danielcweek @jackye1995

Sep 18 '22 02:09 amogh-jahagirdar

@amogh-jahagirdar very good points. Following this line of thought, we should probably also address the question of whether the SQL text should be validated in Iceberg/the API implementation, or validation is left to the enclosing engine. If it is validated in Iceberg/the API implementation, then Iceberg/implementation needs to understand the dialects as well. If it is left to the enclosing engine, there is the risk of passing an invalid view text.

Similarly, whether it is responsibility of Iceberg (or the API implementation) to ensure the semantic equivalence of all the representations (and not only the schemas) when the intention is to add more dialects of the same logic.

Finally, if we assume there is a translation tool, we may consider if it makes sense to store one representation (dialect) and let the remaining representations (dialects) be handled by translation, as opposed to letting the caller generate all the representations, which as you indicated, could fall out of sync.

By the way, I believe adding new representations (e.g., dialects) of semantically equivalent views (case 2 above) should not increment the view version. Do you agree? It seems this might result in having to distinguish between view representations that only add a dialect and preserve the semantics and view representations that change the semantics.

To answer your question about Coral schema validation, Coral infers the schema of a view given the view text and base table schemas (see some test cases). It uses Avro schemas for representing the schemas (due to its rich semantics), but extension to use Iceberg schemas is straightforward.

Sep 21 '22 08:09 wmoustafa

Should the spec define rules for schema evolution of views across different versions? The implementation should probably enforce them.

Sep 21 '22 08:09 wmoustafa

Following this line of thought, we should probably also address the question of whether the SQL text should be validated in Iceberg/the API implementation, or validation is left to the enclosing engine. If it is validated in Iceberg/the API implementation, then Iceberg/implementation needs to understand the dialects as well. If it is left to the enclosing engine, there is the risk of passing an invalid view text.

I would vote for leaving to the enclosing engine or translation tool to validate the correctness and semantic equivalence in the initial API implementation.

Imagine in the most manual way, there is a data engineer manually adding 2 representations for Spark and Trino, and the engineer does the manual translation, and the translation might not be perfect but works for their use case. Then the engineer invokes this API to update the Iceberg view spec. We should make sure at least this use case can proceed without issue without Iceberg throwing unnecessary validation exceptions.

We can always introduce features like CoralValidatedSqlRepresentation for people who would like to use certain tool, and plugin points in the Iceberg library, such that the integrations could live as plugin packages that don't even need to live in the Iceberg repository.

I think we'll probably need to better define what createView exactly entails. There are 2 cases I can think of that should be handled in createView.

Based on the logic above, I would vote for (1), which is to create a brand new view, and have another API for updateViewRepresentations.

Sep 22 '22 17:09 jackye1995

Thanks @wmoustafa for your thoughts and sorry for the late reply. So here are my thoughts:

Following this line of thought, we should probably also address the question of whether the SQL text should be validated in Iceberg/the API implementation, or validation is left to the enclosing engine. If it is validated in Iceberg/the API implementation, then Iceberg/implementation needs to understand the dialects as well. If it is left to the enclosing engine, there is the risk of passing an invalid view text.

Yes, after some thought, my thinking is in the long run the Iceberg View format should enforce correctness and semantic equality between the different representations. However, I think for an initial implementation though it suffices to just put it on the engine or translation layer to validate. Later on, validation can be implemented within the View API.

By the way, I believe adding new representations (e.g., dialects) of semantically equivalent views (case 2 above) should not increment the view version. Do you agree? It seems this might result in having to distinguish between view representations that only add a dialect and preserve the semantics and view representations that change the semantics.

So view versions in my mind should represent every single change to the view, similar to there's a new table metadata version regardless of if there is a new snapshot or not. Another argument is that view version should only increment for semantic or schema changes in the view (which if i'm not mistaken is what you're suggesting), but I think the mental model would be hard to grasp here when people try and compare with Iceberg table specification.

Also for an initial implementation if we don't have any validation controls, then I think it becomes necessary to increment on any new version representations which are updated, so that people can rollback the view if it's in a state they do not like.

I'm on board with having a separated updateViewRepresentations which has a "merge" like semantic for updating view representations (either replacing existing ones of the same type and dialect OR if non-existing, create the new representation)

Thoughts @jzhuge @danielcweeks ?

Sep 27 '22 22:09 amogh-jahagirdar

@jzhuge I saw schema was optional in the spec, I see the rationale here https://github.com/apache/iceberg/pull/3188#discussion_r748303084

I think we'll want a separate createView which also accepts schema. I've raised a PR to your branch https://github.com/jzhuge/iceberg/pull/2/files for addressing this and a definition for updateViewRepresentations for adding new representations of new types and replacing existing representations of the same type.

The PR also has some other fixes for long versionId in the history entry and doc adjustments.

Let me know what you think!

Sep 30 '22 17:09 amogh-jahagirdar

@jzhuge, I just had a look at this and it's mostly good. I think a few places should be updated though. Can you make the changes? Thanks!

Oct 09 '22 23:10 rdblue

@amogh-jahagirdar Are you ok with using int instead of long for version id?

Oct 13 '22 01:10 jzhuge

3 ways to update view metadata:

buildView() + create/replace(): create or replace the entire view
loadView().updateRepresentations() + commit(): update representations
loadView().updateProperties() + commit(): update properties

Oct 13 '22 05:10 jzhuge

To create a view with SQL representation:

ViewRepresentation sqlViewRepresentation = SQLViewRepresentation.builder()
    .sql("SELECT foo FROM base")
    .dialect("spark")
    .defaultCatalog("prodhive")
    .defaultNamespace(Collections.singleList("default"))
    .build()

viewCatalog.buildView(ident)
    .withSchema(schema)
    .withRepresentation(sqlViewRepresentation)
    .createOrReplace()

Oct 13 '22 05:10 jzhuge

Added ViewBuilder.withLocation and rebased to master branch.

Oct 15 '22 19:10 jzhuge

@rdblue Added buildSqlView for convenience, so the above example can be shortened as:

viewCatalog.buildSqlView(ident)
    .withSchema(schema)
    .withSql("SELECT foo FROM base")
    .withDialect("spark")
    .withDefaultCatalog("prodhive")
    .withDefaultNamespace(Collections.singleList("default"))
    .withProperty("k1", "v1")
    .withLocation(s3Loc)
    .createOrReplace()

Should we add this convenience api for common use cases? For builder method names, should we use with prefix or not? I see examples of both in the existing code.

Oct 16 '22 06:10 jzhuge

@rdblue @danielcweeks @stevenzwu @amogh-jahagirdar @jackye1995 @nastra Please take another look.

Summary of discussions:

ViewBuilder API focuses on SQL view first. Do not expose vague APIs to manage a list of representations.
Interface name ViewVersion not ideal, but no good alternative.
Version ID type is integer, not long.

Usage example:

viewCatalog.buildView(ident)
    .withSchema(schema)
    .withQuery("SELECT foo FROM base")
    .withDialect("spark")
    .withDefaultCatalog("prodhive")
    .withDefaultNamespace(Collections.singleList("default"))
    .withProperty("k1", "v1")
    .withLocation(s3Loc)
    .createOrReplace();

Oct 19 '22 00:10 jzhuge

Interface name ViewVersion not ideal, but no good alternative.

How about ViewMetadataVersion, ViewMetadataSnapshot, or simply MetadataSnapshot?

I am still unclear on whether we could store different dialects for the same definition as one version or multiple dialects (even for the same definition) imply different versions.

Related, how can a ViewVersion return a list of representations?

Finally, what in the example you gave instructs building a SQLViewRepresentation?

Oct 19 '22 01:10 wmoustafa

How about ViewMetadataVersion, ViewMetadataSnapshot

Like ViewMetadataVersion.

I am still unclear on whether we could store different dialects for the same definition as one version or multiple dialects (even for the same definition) imply different versions.

Multiple dialects are available at the same time, so in one version.

Related, how can a ViewVersion return a list of representations?

Each Version contains a list of representations. See view spec, section Versions.

Finally, what in the example you gave instructs building a SQLViewRepresentation?

In the created View, current version contains a single SQL view representation.

Oct 19 '22 02:10 jzhuge

Multiple dialects are available at the same time, so in one version.

What is an API example to achieve that?

Each Version contains a list of representations. See view spec, section Versions. In the created View, current version contains a single SQL view representation.

Could you clarify if the above two statements are consistent? It seems the former implies a version contains a list of representations and the latter indicates a version contains a single representation.

Finally, what in the example you gave instructs building a SQLViewRepresentation?

This question was about how to implement the builder. Will this builder result in creating a SQLViewRepresentation? If yes, how would it know it should create a SQLViewRepresentation vs something else (e.g., not SQL representation)?

Oct 19 '22 03:10 wmoustafa

Multiple dialects are available at the same time, so in one version.

What is an API example to achieve that?

Not exposed currently. Do you have a use case? What do u need in the API to achieve it?

Each Version contains a list of representations. See view spec, section Versions. In the created View, current version contains a single SQL view representation.

Could you clarify if the above two statements are consistent? It seems the former implies a version contains a list of representations and the latter indicates a version contains a single representation.

What I meant was the current builder API only supports creating a view view with a single representation, but the view spec and on-disk format does support multiple representations.

Finally, what in the example you gave instructs building a SQLViewRepresentation?

This question was about how to implement the builder. Will this builder result in creating a SQLViewRepresentation? If yes, how would it know it should create a SQLViewRepresentation vs something else (e.g., not SQL representation)?

Yes. The current view builder only supports SQLViewRepresentation. For other representations, one idea is to add more withXXX methods. Any suggestion?

Oct 19 '22 18:10 jzhuge

Not exposed currently. Do you have a use case? What do u need in the API to achieve it?

I do not necessarily have a use case, but was trying to understand the API contracts. It sounds like something that should be either explicitly supported or not supported by the API (i.e., without having to introduce a backward incompatibility in the future to support it), and probably we should show a pathway for this API to be compatible with the spec.

As a follow up, in case both multiple representations could share the same version, and additional representations could end up with a new version, semantically what does a new version mean? Is it an addition to the previous version, or a replacement? For example, if I add the Spark representation in one version and the equivalent Trino representation in another (same query but with different syntax), is it allowed? Should Spark queries start to fail?

Yes. The current view builder only supports SQLViewRepresentation. For other representations, one idea is to add more withXXX methods. Any suggestion?

Just making sure that withQuery is the one that signifies the SQLViewRepresentation. It sounded there was another source to express the fact that it is a SQL view (e.g., the fact that the dialect is Spark). Would be good to find out how to normalize those things (so there are not two independent hints about the same representation).

Oct 19 '22 19:10 wmoustafa

@wmoustafa Thanks for the valuable questions and feedbacks. Echo your point that it is confusing to store representations in different view versions.

I'd suggest this API contract:

Add the main SQL view representation and other dialects in buildView API so all are stored in one view version
For a new representation down the road, add a new interface and a new withXXX in buildView
No need for something like View.updateRepresentions(), as buildView() + replace() should be enough

Oct 24 '22 18:10 jzhuge

iceberg iceberg copied to clipboard

API: Add view interfaces

Create View Semantics

Schema Validation of different representations

iceberg
iceberg copied to clipboard