xskipper icon indicating copy to clipboard operation
xskipper copied to clipboard

An Extensible Data Skipping Framework

Results 11 xskipper issues
Sort by recently updated
recently updated
newest added

Need to verify why the logs says `Dataset is not indexed => no skipping` Getting above log line in case of both enableDynamicDataSkipping and disableDynamicDataSkipping Despite the dataset being indexed,...

### What changes are proposed in this pull request? Compile against Spark 3.4.1 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested?

Need to verify if when using datasource v2 the metadata is being read twice. This is currently handled in the tests by multiplying by 2 the number of files expected...

Currently the catalog table tests are running only for Parquet formats. We should add tests that will run for csv, json, avro etc.

This issue will track the progress of decoupling xskipper from Spark. Xskipper currently works solely with Spark and can not be used out of the box with other engines. This...

[SPARK-7768](https://issues.apache.org/jira/browse/SPARK-7768?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22udt%20public%22) finally made the UDT API public. We can use this instead of the workaround we had so far using `ParquetMetadataStoreUDTRegistrator` to expose the `UDTRegistration` class.

The current API doesn't enable to add a new index without first dropping the existing indexes and the collecting all of the indexes again. This API will be useful for...

[Hudi's Copy-on-Write](https://hudi.apache.org/docs/querying_data.html#spark-datasource) optimized mode is using the standard Spark's parquet read code path. This means we should be able to support it as we already support skipping over parquet datasets...

[Delta Lake](https://delta.io) implementation is using a custom `FileIndex` implementation called [`TahoeFileIndex`](https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/files/TahoeFileIndex.scala). Therefore, we can support skipping for Delta Lake in a similar way that we support skipping for the built...

An initial proposal that will enable to support Iceberg is detailed [here](https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit#).