delta-sharing deltaSharing Spark source doesn't use statistics for count operation

The simple code that I'm using to check sizes of the shares is timeouts because it's trying to load data, although that kind of operations could be answered directly from stats returned when we query the table

import delta_sharing
profile_file = "https://github.com/delta-io/delta-sharing/raw/main/examples/open-datasets.share"
client = delta_sharing.SharingClient(profile_file)
all_tables = client.list_all_tables()

for tbl in all_tables:
    table_name = f"{tbl.share}.{tbl.schema}.{tbl.name}"
    table_url = f"{profile_file}#{table_name}"
    print(f"Going to read {table_name}")
    df = spark.read.format("deltaSharing").load(table_url)
    print(f"{table_name} count: {df.count()}")

Apr 02 '22 12:04 alexott

@alexott Sorry for the late response. Are you saying to overload df.count() to leverage stats.numRecords instead of loading the full table?

Oct 24 '22 23:10 linzhou-db

Yes, why not reuse metadata instead of crunching files...

Oct 25 '22 06:10 alexott

Well we could if it's common that if a lot users care about count only. Two things I'm not quite sure: a) if numRecords is always accurate, 2) how to override the count method.

Are you willing to send out a PR for this change?

Oct 25 '22 22:10 linzhou-db