delta-sharing icon indicating copy to clipboard operation
delta-sharing copied to clipboard

Support for sampleHint

Open polsm91 opened this issue 3 years ago • 1 comments

The proposal is to add the sampling pushdown to the protocol.

Context

The /query endpoint reduces the number of files that are exposed and fetched by a client to reduce data transfers and to speed up data loading. The mechanisms described in the protocol are two:

  1. Predicate Hints: Allows filtering which files are needed taking advantage of the data layout, e.g., if data is partitioned, we may access only a few partitions.
  2. Limit Hints: Thanks to the delta log, the backend knows how many records each file contains and can return only the files needed to satisfy the limit.

Both are hints, meaning they may not be satisfied, and the client may need to apply the actions in-memory too.

Proposal

Another way to reduce the number of files to be read is by introducing the sampling pushdown. The idea is that if the client wants to access a sample of data, the sampling predicate is sent to the backend, who then filters which files need to be accessed.

My proposal is to extend the protocol with an optional predicate on the query endpoint called "sampleHint" which would look like this:

{
  "predicateHints": [
    "date >= '2021-01-01'",
    "date <= '2021-01-31'"
  ],
  "limitHint": 1000,
  "sampleHint": 0.30,
  "version": 123
}

The server would try to select the minimum amount of files to satisfy the request. The sampleHint would be a double in the range of 0 to 1.0 as it is commonly done in the ecosystem.

Order of operations

sampleHint takes precedence, then predicateHints, and finally limitHint should be applied.

polsm91 avatar Jun 15 '22 15:06 polsm91

@polsm91 Sorry for the late response.

So could you share a specific use case where limitHint would not work and sampleHint could help? I feel sampleHint could satisfy most cases?

linzhou-db avatar Oct 21 '22 22:10 linzhou-db