deltaSharing Spark source doesn't use statistics for count operation
The simple code that I'm using to check sizes of the shares is timeouts because it's trying to load data, although that kind of operations could be answered directly from stats returned when we query the table
import delta_sharing
profile_file = "https://github.com/delta-io/delta-sharing/raw/main/examples/open-datasets.share"
client = delta_sharing.SharingClient(profile_file)
all_tables = client.list_all_tables()
for tbl in all_tables:
table_name = f"{tbl.share}.{tbl.schema}.{tbl.name}"
table_url = f"{profile_file}#{table_name}"
print(f"Going to read {table_name}")
df = spark.read.format("deltaSharing").load(table_url)
print(f"{table_name} count: {df.count()}")
@alexott Sorry for the late response.
Are you saying to overload df.count() to leverage stats.numRecords instead of loading the full table?
Yes, why not reuse metadata instead of crunching files...
Well we could if it's common that if a lot users care about count only. Two things I'm not quite sure: a) if numRecords is always accurate, 2) how to override the count method.
Are you willing to send out a PR for this change?