delta-sharing icon indicating copy to clipboard operation
delta-sharing copied to clipboard

Support for `predicateHints` option in `delta_sharing.load_as_spark`

Open ycrouin opened this issue 2 years ago • 4 comments

Hello team :wave:

From my understanding, the predicateHints option allows querying a table using S3 partitions as filters https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md#read-data-from-a-table to retrieve fewer signed URLs.

Right now, we are loading our data through a Spark job using the delta_sharing.load_as_spark(url: str, version: Optional[int] = None) -> "PySparkDataFrame" function. However we don't see any argument for the predicateHints option.

Are you planning to add this option? If not, what should we do? I was thinking about making the POST query to the DeltaShare server ourselves and then providing the list of S3 signed URLs from the response to spark.read.parquet. Would that work?

ycrouin avatar Jun 30 '22 16:06 ycrouin

Hi @ycrouin , sorry for the late response.

Do you still need this feature, we could prioritize this in Q4, to make it an argument in the spark query, and pass it from the client to the server.

Also making the POST query your selves should work, have you tried that?

linzhou-db avatar Oct 21 '22 22:10 linzhou-db

Hello @linzhou-db, thanks for reaching back to me! No, we don't really need this feature anymore, but it might be useful to others.

Eventually, we were facing timeouts with the Delta Sharing POST query, probably because of the large number of parquet files to pre-sign (up to millions). To overcome this issue, we're now using Delta Core on Spark directly, with s3 cross-account permissions.

ycrouin avatar Oct 23 '22 17:10 ycrouin

I want support for predicateHints on load_as_pandas function, can it be done ? As of now load_as_pandas will load whole table, there is no way to pass filter.

desingho avatar Jan 19 '23 11:01 desingho

@desingho I figured this out recently:

import delta_sharing

rest_client = ...
table = delta_sharing.Table(
  share='share-name',
  schema='schema-name',
  name='table-name')
predicateHints = ('a >= 7', 'a < 10')

data = delta_sharing.delta_sharing.DeltaSharingReader(
  table=table, predicateHints=predicateHints, rest_client=rest_client,
).to_pandas()

jacobmarble avatar Mar 09 '23 00:03 jacobmarble