Decouple xskipper from Spark

Open guykhazma opened this issue 4 years ago • 0 comments

This issue will track the progress of decoupling xskipper from Spark.

Xskipper currently works solely with Spark and can not be used out of the box with other engines. This dependency manifests itself in the following parts:

Metadata format - The metadata is saved in parquet with the index details (index metadata) saved in the schema metadata of the DataFrame (see here). This metadata is saved in the parquet key-value store under Spark's schema key org.apache.spark.sql.parquet.row.metadata. To decouple Xskipper metadata from Spark we should change the metadata to be saved as a separate key in the parquet key-values store. Note that an alternative is to maintain the metadata in a separate file (such as json) but the advantage of the above approach is that it makes each parquet file a standalone index file which can be used as a look up for indexes on the objects it contains.
Bloom filter representation - The current bloom filter implementation is using Spark's bloom filter in order to use the bloom filter cross engine we should use a more standard implementation.
Filtering code path -
- The entry to the filtering code path is the DataSkippingFileFilter class this class handles the skipping logic by exposing 2 methods:
  - init - gets as an input the filters along with (optional) metadataFilterFactories and clauseTranslators and initialise the state of the filter (typically by getting all indexed and required files)
  - isRequired - called on each file and return false If the file can be skipped.
  Currently the class has SparkSession as one of its constructor parameters. In order to enable pluggability with multiple engines, the class should change to have an empty constructor and we should change the above methods to be useful for multiple engines. (one example can be see in the prototype for Iceberg integration)
- Expression Tree representation - The DataSkippingFileFilter class gets the expression tree of data/partition filters which were applied on the table and tries to construct an abstract query as described here. Currently we use Spark's Expression to represent the applied filter - this should change to an internal Expression Tree so that each integration will translate its expressions to the internal Expression Tree class. This means changing all of the filters to use the new Expression Tree classes. For example, translating Spark's Expression to these representations, or a similar translation to what we have in the prototype for iceberg where we currently translate Iceberg's expressions to Spark Expressions as a quick way to demonstrate xskipper integration with Iceberg

In addition we use spark Logging trait for log writing in the code, this should be eliminated as well.

Mar 03 '21 21:03 guykhazma