delta-sharing icon indicating copy to clipboard operation
delta-sharing copied to clipboard

Protocol spec - Read table data can have large responses for tables with many files.

Open rustyconover opened this issue 2 years ago • 4 comments

In the Delta sharing protocol spec there doesn't appear to be any pagination for the:

{prefix}/shares/{share}/schemas/{schema}/tables/{table}/query

API endpoint.

I have some tables that contain 10,000+ files so if no predication conditions are applied the response of the API will be very large. If statistics per file are returned the total size for the response could be tens of megabytes.

Responses greater than 6mb in size can be a problem for various API gateways and AWS Lambda functions.

Would you please consider adding pagination into the API method so that Delta sharing could be processed efficiently without singular large responses to API calls?

rustyconover avatar Feb 14 '23 00:02 rustyconover

Hi @rustyconover , yes pagination is in our roadmap. Feel free to make changes to the open source server if you are interested.

Where is the "6mb" from though? It's quite small.

linzhou-db avatar Feb 15 '23 00:02 linzhou-db

Hi, @linzhou-db

6mb is the maximum response size of an AWS Lambda function.

rustyconover avatar Feb 15 '23 03:02 rustyconover

You can see the limits here:

https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html

rustyconover avatar Feb 15 '23 03:02 rustyconover

The AWS Lambda limit is a good one to know. Thanks @rustyconover .

In InfluxDB, the quantity of Parquet files required to satisfy a query varies widely.

  • the write pipeline creates many tiny (<100Kb) Parquet files
  • the compaction pipeline rewrites those ^^ data as fewer large (>1GB) Parquet files
  • most queries are served by a mix of both
  • occasionally the compaction pipeline gets backed up, which has obvious consequences on that mix

jacobmarble avatar Mar 07 '23 17:03 jacobmarble