datachain icon indicating copy to clipboard operation
datachain copied to clipboard

dataset stats: include table size

Open dmpetrov opened this issue 1 year ago • 4 comments

We are storing number of records and file sizes to dataset records which is convenient to have. In addition to this, it would be very. convenient to have size of physical table for a dataset. It's needed because in some cases a table can take 90Gb (laion with 12M files).

dmpetrov avatar Aug 11 '24 20:08 dmpetrov

@iterative/datachain I'd appreciate your opinion on this. How realistic it is to get this information from the real DBs.

dmpetrov avatar Aug 11 '24 20:08 dmpetrov

How realistic it is to get this information from the real DBs.

Easy.

SQLite

We can use dbstat to get table size:

sqlite> SELECT SUM("pgsize") FROM "dbstat" WHERE name='ds_xxx';
5980180480
sqlite>

Dbstat is only available when SQLite is built using the SQLITE_ENABLE_DBSTAT_VTAB compile-time option. We can check if SQLITE_ENABLE_DBSTAT_VTAB is enabled using PRAGMA compile_options:

sqlite> PRAGMA compile_options;
...
ENABLE_DBSTAT_VTAB
...
sqlite>

ClickHouse

4d6e8a4704d9 :) SELECT sum(bytes) as size FROM system.parts WHERE table = 'ds_xxx';

SELECT sum(bytes) AS size
FROM system.parts
WHERE `table` = 'ds_xxx'

Query id: a8595e26-a522-4d7a-8155-bb3ba4b1c68a

   ┌────────size─┐
1. │ 10973571294 │ -- 10.97 billion
   └─────────────┘

1 row in set. Elapsed: 0.009 sec.

4d6e8a4704d9 :)

dreadatour avatar Aug 12 '24 13:08 dreadatour

how this will look like in the dataset schemas that user will get in API? You recently introduced these, right?

dmpetrov avatar Aug 12 '24 18:08 dmpetrov

So we do have num_objects and size fields in dataset_version table (source code). Here we have a method to update these fields. And here is the code for getting these stats from the DB.

Steps to implement physical table size is:

  1. Update dataset_version table and add size_bytes field (name is questionable). Optional is to have json field (stats?) with all stats in one field (similar to preview field).
  2. Update other methods with respect to new dataset stat field with respect to DB (implementation will be different).

Sounds quite easy to me.

dreadatour avatar Aug 13 '24 04:08 dreadatour