delta-sharing
delta-sharing copied to clipboard
Support for `predicateHints` option in `delta_sharing.load_as_spark`
Hello team :wave:
From my understanding, the predicateHints
option allows querying a table using S3 partitions as filters https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md#read-data-from-a-table to retrieve fewer signed URLs.
Right now, we are loading our data through a Spark job using the delta_sharing.load_as_spark(url: str, version: Optional[int] = None) -> "PySparkDataFrame"
function. However we don't see any argument for the predicateHints
option.
Are you planning to add this option? If not, what should we do? I was thinking about making the POST query to the DeltaShare server ourselves and then providing the list of S3 signed URLs from the response to spark.read.parquet
. Would that work?
Hi @ycrouin , sorry for the late response.
Do you still need this feature, we could prioritize this in Q4, to make it an argument in the spark query, and pass it from the client to the server.
Also making the POST query your selves should work, have you tried that?
Hello @linzhou-db, thanks for reaching back to me! No, we don't really need this feature anymore, but it might be useful to others.
Eventually, we were facing timeouts with the Delta Sharing POST query, probably because of the large number of parquet files to pre-sign (up to millions). To overcome this issue, we're now using Delta Core on Spark directly, with s3 cross-account permissions.
I want support for predicateHints on load_as_pandas function, can it be done ? As of now load_as_pandas will load whole table, there is no way to pass filter.
@desingho I figured this out recently:
import delta_sharing
rest_client = ...
table = delta_sharing.Table(
share='share-name',
schema='schema-name',
name='table-name')
predicateHints = ('a >= 7', 'a < 10')
data = delta_sharing.delta_sharing.DeltaSharingReader(
table=table, predicateHints=predicateHints, rest_client=rest_client,
).to_pandas()