iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

View Spec implementation

Open c-thiel opened this issue 10 months ago • 5 comments

Implementation of the Iceberg V1 ViewMetadata

c-thiel avatar Apr 11 '24 10:04 c-thiel

Thanks! Nice work! @c-thiel

ZENOTME avatar Apr 11 '24 14:04 ZENOTME

@ZENOTME one feature I did not implement yet is respecting "version.history.num-entries" as mentioned in the View Spec. I noticed that the table implementation also doesn't implement it. I don't think it matters much unless one is implementing a catalog, but I wanted to mention it.

c-thiel avatar Apr 12 '24 14:04 c-thiel

Thanks for working on this @c-thiel. Out of curiosity, and probably this is a broader discussion, but currently, the view spec is around SQL representation. This might not be the most natural fit with dataframe-based systems. Do you have any thoughts on how to represent views in iceberg-rust?

Fokko avatar Apr 15 '24 09:04 Fokko

@Fokko thats is a hard topic.

The idealist in me would like to eventually see something like substrait beeing used. Adoption of the project is very slow across engines though.

Maybe the more practical short-term solution would be to stick to representations and even limit to one dialect - the one of the "writer", including its technology. Then each client can decide if there is a transpiler good enough for that dialect. Funnily enough exactly this ambiguity - that there could be different dialects in the representation (say spark and trino) - gave me a lot of headaches. Which should I use if I am Datafusion? From which should I transpile? What to do if they are not identical?

Sticking to multiple representations is in my opinion only an option if we let the catalog handle this complexity. Unless substrait gets a significant boost, I would assume that the following approach presents the most stable API (Assuming there is only REST Catalog left ;) ):

  • A client would create a view as it does today - however limited to a single representation
  • Upon read, a client would additionally present a set of "understood" dialects. The catalog can either serve them or it can't.

This way a Rest-Catalog could even switch to Substrait internally in the future and start now with storing the presented SQL "as-is" without beeing able to serve any other dialect.

With SQLMesh coming up and also Substrait continuing there is at least some development in the area.

c-thiel avatar Apr 18 '24 12:04 c-thiel

@Fokko from my side this is good to merge - types are complete and tests are passing.

c-thiel avatar May 04 '24 12:05 c-thiel

@Fokko @ZENOTME are there any points open from your side that prevent us to merge the View Spec? If so, please let me know :)

c-thiel avatar Jun 20 '24 06:06 c-thiel

The idealist in me would like to eventually see something like substrait beeing used.

There is an open discussion on evolving the spec into something like that. I think transpiling SQL will never work in the long run because there are so many variations and extensions to the SQL dialect (custom UDFs and such), maybe you can reach 90% of the SQL which is good enough.

@nastra would you have time to go over this PR?

Fokko avatar Jun 20 '24 08:06 Fokko

@nastra, @Fokko during testing we found a Problem with the "default-namespace". I am hoping for some insights from your side: According to the iceberg view spec, "default-namespace" is required: https://iceberg.apache.org/view-spec/#versions. As such it is modeled as NamespaceIdent: https://github.com/c-thiel/iceberg-rust/blob/d8a7aabce73888e5b22712d60ca95b856db66fc2/crates/iceberg/src/spec/view_version.rs#L56

Valid Namespace Identifiers have at least one element, which is validated by the NamespaceIdent constructor.

When creating views, spark creates Metadata objects that specify the "default-namespace" field but as an empty vec. This is a quite unlucky situation, I am not sure what the desired behavior is:

  1. My feeling is that this is a bug in the Iceberg Spec. [] as "default-namespace" provides the same information as a non-required field (null), as a namespace without an element is unusable. However, views in spark work, which shows that default-namespace is not always required. Thus it should be optional.
  2. It's a bug in the spark implementation and in fact "default-namespace" should always contain at least one element.
  3. They are both right, in which case we would except [] as a valid namespace and must change the rust implementation.
  4. [] is in general not a valid namespace unless it is used as a "default-namespace" in which some special behavior is expected. This however brings me back to 1 - why not just use null in this case?

c-thiel avatar Jul 02 '24 22:07 c-thiel

When creating views, spark creates Metadata objects that specify the "default-namespace" field but as an empty vec

Is this with plain OSS Spark or are you using the iceberg-spark-runtime.jar from Iceberg? Just asking, because we're properly handling the namespace when creating a view in https://github.com/apache/iceberg/blob/2a39af894f4f00aa37922ef765cc2583517fa1d1/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateV2ViewExec.scala#L51. This code then calls https://github.com/apache/iceberg/blob/81b3310ab469408022cc14af51257b7e8b36614f/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L593.

It would ne helpful to understand which Spark + Iceberg version you're using and how you're creating the view in Spark.

nastra avatar Jul 03 '24 06:07 nastra

Hi @nastra,

this script:

import pyspark
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pandas as pd

CATALOG_URL = "http://server:8080/catalog"
MANAGEMENT_URL = "http://server:8080/management"
DEMO_WAREHOUSE = "demo"

config = {
    "spark.sql.catalog.demo-catalog": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.demo-catalog.type": "rest",
    "spark.sql.catalog.demo-catalog.uri": CATALOG_URL,
    "spark.sql.catalog.demo-catalog.warehouse": DEMO_WAREHOUSE,
    "spark.sql.catalog.demo-catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.defaultCatalog": "demo-catalog",
    "spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.iceberg:iceberg-aws-bundle:1.5.2",
}

spark_config = SparkConf().setMaster('local').setAppName("Iceberg-REST")
for k, v in config.items():
    spark_config = spark_config.set(k, v)

spark = SparkSession.builder.config(conf=spark_config).getOrCreate()
spark.sql("CREATE NAMESPACE IF NOT EXISTS spark_demo")
data = pd.DataFrame([[1, 'a-string', 2.2]], columns=['id', 'strings', 'floats'])
sdf = spark.createDataFrame(data)
sdf.writeTo("spark_demo.my_table").createOrReplace()
spark.sql("CREATE view spark_demo.vv as select * from spark_demo.my_table")

ends up sending that POST to /catalog/v1/df910b7e-3912-11ef-a048-6b2680efd54d/namespaces/spark_demo/views

{
  "name": "vv",
  "schema": {
    "schema-id": 0,
    "type": "struct",
    "fields": [
      {
        "id": 0,
        "name": "id",
        "required": false,
        "type": "long"
      },
      {
        "id": 1,
        "name": "strings",
        "required": false,
        "type": "string"
      },
      {
        "id": 2,
        "name": "floats",
        "required": false,
        "type": "double"
      }
    ]
  },
  "view-version": {
    "version-id": 1,
    "schema-id": 0,
    "timestamp-ms": 1719994069884,
    "summary": {
      "app-id": "local-1719994018459",
      "engine-name": "spark",
      "iceberg-version": "Apache Iceberg 1.5.2 (commit cbb853073e681b4075d7c8707610dceecbee3a82)",
      "engine-version": "3.5.1"
    },
    "representations": [
      {
        "type": "sql",
        "sql": "select * from spark_demo.my_table",
        "dialect": "spark"
      }
    ],
    "default-namespace": []
  },
  "properties": {
    "create_engine_version": "Spark 3.5.1",
    "spark.query-column-names": "id,strings,floats",
    "engine_version": "Spark 3.5.1"
  }
}

twuebi avatar Jul 03 '24 08:07 twuebi

Setting the namespace via:

spark.sql("USE spark_demo")
spark.sql("CREATE view vv3 as select * from spark_demo.mytable")

ends up setting default-namespace:

{
  "name": "vv3",
  "schema": {
    "schema-id": 0,
    "type": "struct",
    "fields": [
      {
        "id": 0,
        "name": "id",
        "required": false,
        "type": "long"
      },
      {
        "id": 1,
        "name": "strings",
        "required": false,
        "type": "string"
      },
      {
        "id": 2,
        "name": "floats",
        "required": false,
        "type": "double"
      }
    ]
  },
  "view-version": {
    "version-id": 1,
    "schema-id": 0,
    "timestamp-ms": 1719994988301,
    "summary": {
      "engine-version": "3.5.1",
      "app-id": "local-1719994018459",
      "engine-name": "spark",
      "iceberg-version": "Apache Iceberg 1.5.2 (commit cbb853073e681b4075d7c8707610dceecbee3a82)"
    },
    "representations": [
      {
        "type": "sql",
        "sql": "select * from spark_demo.my_table",
        "dialect": "spark"
      }
    ],
    "default-namespace": [
      "spark_demo"
    ]
  },
  "properties": {
    "spark.query-column-names": "id,strings,floats",
    "engine_version": "Spark 3.5.1",
    "create_engine_version": "Spark 3.5.1"
  }
}

twuebi avatar Jul 03 '24 08:07 twuebi

"default-namespace": [] indicates an empty namespace, not a null one, which is a valid case and there are tests for that here: https://github.com/apache/iceberg/blob/42a2c19cec31c626cbff6cc2dfafb86cdf223bd0/core/src/test/java/org/apache/iceberg/view/TestViewMetadataParser.java#L56. I've also opened https://github.com/apache/iceberg/pull/9890 a while ago to properly test empty namespaces with catalogs that support it.

However, why you're getting an empty namespace when creating the view with a qualified namespace is a different thing (which I need to look into and reproduce on my end - which I'll try to do before the end of the week)

nastra avatar Jul 03 '24 09:07 nastra

Ok I checked the surrounding code and the handling is correct. In https://github.com/apache/iceberg/blob/2a39af894f4f00aa37922ef765cc2583517fa1d1/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateV2ViewExec.scala#L51 we're always determining the current namespace (which can be empty), which is then what ViewVersion#defaultNamespace is set to. As I mentioned earlier that there's a difference between a null namespace and an empty one.

Spark later uses this info in https://github.com/apache/iceberg/blob/6bbf70a52ebccfaba4e7e08facd72b84b571e2a6/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkView.java#L75 in case there's no namespace in the underlying SQL and in your case you fully-qualified select * from spark_demo.my_table with a namespace.

In your second example you configured the current namespace via USE <namespace>, which was then evaluated https://github.com/apache/iceberg/blob/2a39af894f4f00aa37922ef765cc2583517fa1d1/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateV2ViewExec.scala#L51 and passed to ViewVersion#defaultNamespace (but not used when actually querying the view).

nastra avatar Jul 03 '24 10:07 nastra

Thanks for checking this @nastra, I understand that we should be relaxing the constraints of NamespaceIdent then since an empty default-namespace is expected.

twuebi avatar Jul 04 '24 08:07 twuebi

@nastra I hope I addressed all your issues regarding functionality. If you are happy with the implementations, feel free to close our discussions :). @ZENOTME it would now be great to get some Feedback on the rust side of things. Please check my discussions with Eduard above. There are a few open points regarding naming and also the introduction of types. It would be great to get a review from you or some other Rust Maintainer for this.

c-thiel avatar Jul 13 '24 10:07 c-thiel

@nastra just following up on my last comment - could you close our discussion where you agree with my fix? @liurenjie1024 could you maybe have a look at the rust side of things? Please check my discussions with Eduard above. There are a few open points regarding naming and also the introduction of types. It would be great to get a review from you or some other Rust Maintainer for this.

c-thiel avatar Jul 17 '24 10:07 c-thiel

@nastra just following up on my last comment - could you close our discussion where you agree with my fix? @liurenjie1024 could you maybe have a look at the rust side of things? Please check my discussions with Eduard above. There are a few open points regarding naming and also the introduction of types. It would be great to get a review from you or some other Rust Maintainer for this.

Sure, I'll find time this week to go throught this.

liurenjie1024 avatar Jul 17 '24 14:07 liurenjie1024

cc @Xuanwo @nastra @Fokko to take another look

liurenjie1024 avatar Jul 26 '24 07:07 liurenjie1024

cc @Xuanwo @nastra @Fokko to take another look

Hi, thanks for the invitation, but I'm not familiar with this topic. Also, apologies for not being involved from the start. I will delegate my decision-making to @Fokko instead.

Xuanwo avatar Jul 26 '24 10:07 Xuanwo

Rebased on current master

c-thiel avatar Jul 27 '24 12:07 c-thiel

I think this pr is good enough, so I will merge it since it has been long. Thanks @c-thiel 's effort on this, and thanks @Xuanwo @nastra @ZENOTME 's review!

liurenjie1024 avatar Jul 28 '24 14:07 liurenjie1024