datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Making Comet Common Module Engine Independent

Open huaxingao opened this issue 1 year ago • 12 comments

I'd like to initiate a discussion about making the Comet common module independent of Spark.

This issue emerged during the Comet/Iceberg integration. The Iceberg community is seeking a more generic Column Reader that takes Arrow data type and reads data into Arrow Vector. We are considering separating all Spark-dependent code, keeping only the generic reading processes in the common module, and adapting to Spark ColumnVector in the Spark module. This change would make Comet more generic, making it applicable to other engines as well. While this shift is promising, it involves a significant amount of work. I'd like to gather opinions on this proposed direction and understand if there are any concerns or issues that you foresee.

huaxingao avatar Apr 25 '24 21:04 huaxingao

cc @sunchao

viirya avatar Apr 25 '24 21:04 viirya

also cc @andygrove @parthchandra @snmvaughan @kazuyukitanimura

huaxingao avatar Apr 25 '24 21:04 huaxingao

Seems to me it would be a step in the right direction. The idea that comet-common should be independent of any engine is sound. It would be a necessary first step towards integration with other engines (e.g presto/trino). For the most part it looks like the Parquet file reader is mostly independent of Spark. The ColumnReaders and CometVector itself could be made more generic using only Arrow types and Arrow vectors while the adaptation to operate as a Spark columnar batch may be moved into comet-spark. This may involve a fair amount of difficult refactoring, but imho, would be worth it.

parthchandra avatar Apr 25 '24 21:04 parthchandra

I am +1 for making comet-common Arrow-native and easier to integrate with other engines. Let me know how I can help.

andygrove avatar Apr 25 '24 23:04 andygrove

This sounds a good direction to go. In the short term it might add some additional works that require us to refactor common and spark modules, though.

Currently I'm still not sure about integrations with other engines. It is a great target, I think. Although to me it seems a little too far from current project status and bandwidth. 😄

I would like to focus on Spark integration at current stage. But if this refactoring is necessary to move Iceberg integration forward for now, I will support it.

viirya avatar Apr 26 '24 00:04 viirya

+1 for this direction.

We can start migrating in this direction by moving a subset of Utils.scala which is specific to the mapping between Spark and Arrow.

snmvaughan avatar Apr 26 '24 14:04 snmvaughan

I'm +1 for this direction in the long term and I can help review the Iceberg integration if needed.

In the short term, I think Iceberg could integrate Comet in its iceberg-spark module though, which doesn't require Comet's common module to be engine independent? So it would be great that we can make this work incrementally, such as:

  1. release Comet 0.1(or any other first version) first
  2. integrate Comet in Iceberg's spark module
  3. refactor and making comet common module engine independent incrementally in the next release or various releases
  4. integrate Comet in Iceberg's arrow/comet module and make the vectorized reader generally available for other engines in the iceberg repo.

advancedxy avatar Apr 29 '24 07:04 advancedxy

@advancedxy Good suggestions. I believe this Issue is to address point 3 above while 1 and 2 are in progress.

parthchandra avatar Apr 29 '24 17:04 parthchandra

@advancedxy Good suggestions. I believe this Issue is to address point 3 above while 1 and 2 are in progress.

Thanks for the clarification, it makes totally sense then.

advancedxy avatar Apr 30 '24 09:04 advancedxy

The original purpose of comet-common module is to make it engine-agnostic so it can be used for other use cases like Iceberg. Unfortunately we didn't have time to make it completely isolated, so it is still tightly coupled with Spark in several ways like Parquet -> catalyst schema conversion, ColumnVector, and later on a bunch of shuffle related stuff which are all closely related to Spark.

If necessary, we can perhaps consider splitting the module further into comet-parquet, comet-spark-shuffle etc. For the Parquet part, we may need to define something like CometDataType which gets converted from the Parquet schema, and from which we can derive Spark catalyst data type or Iceberg data type.

sunchao avatar May 02 '24 04:05 sunchao

For the Parquet part, we may need to define something like CometDataType which gets converted from the Parquet schema, and from which we can derive Spark catalyst data type or Iceberg data type.

How about Arrow types as the canonical data types for Comet? org.apache.spark.sql.util.ArrowUtils has conversions between Arrow and Spark schema/types.

parthchandra avatar May 02 '24 23:05 parthchandra

I think it makes more sense to use Arrow types as a bridge between Comet and Parquet reader in Iceberg.

viirya avatar May 03 '24 01:05 viirya