dataset stats: include table size
We are storing number of records and file sizes to dataset records which is convenient to have. In addition to this, it would be very. convenient to have size of physical table for a dataset. It's needed because in some cases a table can take 90Gb (laion with 12M files).
@iterative/datachain I'd appreciate your opinion on this. How realistic it is to get this information from the real DBs.
How realistic it is to get this information from the real DBs.
Easy.
SQLite
We can use dbstat to get table size:
sqlite> SELECT SUM("pgsize") FROM "dbstat" WHERE name='ds_xxx';
5980180480
sqlite>
Dbstat is only available when SQLite is built using the SQLITE_ENABLE_DBSTAT_VTAB compile-time option. We can check if SQLITE_ENABLE_DBSTAT_VTAB is enabled using PRAGMA compile_options:
sqlite> PRAGMA compile_options;
...
ENABLE_DBSTAT_VTAB
...
sqlite>
ClickHouse
4d6e8a4704d9 :) SELECT sum(bytes) as size FROM system.parts WHERE table = 'ds_xxx';
SELECT sum(bytes) AS size
FROM system.parts
WHERE `table` = 'ds_xxx'
Query id: a8595e26-a522-4d7a-8155-bb3ba4b1c68a
┌────────size─┐
1. │ 10973571294 │ -- 10.97 billion
└─────────────┘
1 row in set. Elapsed: 0.009 sec.
4d6e8a4704d9 :)
how this will look like in the dataset schemas that user will get in API? You recently introduced these, right?
So we do have num_objects and size fields in dataset_version table (source code). Here we have a method to update these fields. And here is the code for getting these stats from the DB.
Steps to implement physical table size is:
- Update
dataset_versiontable and addsize_bytesfield (name is questionable). Optional is to havejsonfield (stats?) with all stats in one field (similar topreviewfield). - Update other methods with respect to new dataset stat field with respect to DB (implementation will be different).
Sounds quite easy to me.