delta-sharing icon indicating copy to clipboard operation
delta-sharing copied to clipboard

deltaSharing Spark source doesn't use statistics for count operation

Open alexott opened this issue 3 years ago • 3 comments

The simple code that I'm using to check sizes of the shares is timeouts because it's trying to load data, although that kind of operations could be answered directly from stats returned when we query the table

import delta_sharing
profile_file = "https://github.com/delta-io/delta-sharing/raw/main/examples/open-datasets.share"
client = delta_sharing.SharingClient(profile_file)
all_tables = client.list_all_tables()

for tbl in all_tables:
    table_name = f"{tbl.share}.{tbl.schema}.{tbl.name}"
    table_url = f"{profile_file}#{table_name}"
    print(f"Going to read {table_name}")
    df = spark.read.format("deltaSharing").load(table_url)
    print(f"{table_name} count: {df.count()}")

alexott avatar Apr 02 '22 12:04 alexott

@alexott Sorry for the late response. Are you saying to overload df.count() to leverage stats.numRecords instead of loading the full table?

linzhou-db avatar Oct 24 '22 23:10 linzhou-db

Yes, why not reuse metadata instead of crunching files...

alexott avatar Oct 25 '22 06:10 alexott

Well we could if it's common that if a lot users care about count only. Two things I'm not quite sure: a) if numRecords is always accurate, 2) how to override the count method.

Are you willing to send out a PR for this change?

linzhou-db avatar Oct 25 '22 22:10 linzhou-db