citus
citus copied to clipboard
Improve Citus table size functions
Users currently have 2 sets of size functions:
- pg_*_size
- citus_*_size
The PostgreSQL functions give incorrect sizes for distributed tables & indexes on distributed tables. The Citus functions give errors for non-distributed tables & indexes.
We should fix the Citus size functions to handle both local & distributed tables as well as indexes.
Is it expected to be in Citus 12.1 release or 12.2 or later ? Thanks.
I am working on this issue and I have some questions.
If I'm correct, while moving shards for example, it's possible that some shards exist at the same time in distinct workers, "from a space disk usage point of view". Here the PostgreSQL design is to inform of "physical size", not "logical size".
- So do we want citus_*_size functions to return the size disk used by active shards only or also return the space disk used by shards which are moving/pending deletion/"to be valid in the near future" ?
My temptation is to have some int64 citus_local_relation_size(Oid)
which is called on each worker and summed up in int64 citus_relation_size(Oid)
. This way each worker inspect itself the related shards of a relation. I confess I don't know (yet) if information about "valid"/"not valid" is available on the worker node when needed...
I think that's a fair argument, though there are some challenges.
For shard moves, you could perhaps (pessimistically) get the size of the shard from every node. Splits are trickier since the shard IDs are not in the metadata.
Perhaps there should be size functions for pg_dist_cleanup relation records.
From a user point of view, we could just wrap all those size functions calls in a run_command_on_all_shards()
.
So why not just removing all the C code and write simple SQL UDFs ? Apparently the current code is less efficient than a "run on all" (because it's serialized).
We might still need a C version for some internal requirements, but this should probably be revisited anyway (there is some report about incorrect size estimation for shard move/split...)
In general we try to avoid adding logic in SQL functions, because any bugfixes are much more work to backport to old versions.
A downside of using run_command_on_shards
for this specific case is that it will create many queries, one for each shard. And that can be a lot for tables that are both distributed and use postgres partitioning. The queries we generate to get the size these days have been optimized over multiple iterations and are now quite fast in all known cases. I do agree that it would be good to parallelize executing them though.
Regarding incorrectness: Looking at the code the current implementation indeed ignores any "temporary" shards created during shard moves/splits when calculating the size. For reference, this piece of code makes sure the size query only includes shards for which a placement is assigned: https://github.com/citusdata/citus/blob/6801a1ed1ec901bb7bb1905aec8e43d2a7de0d8a/src/backend/distributed/metadata/metadata_utility.c#L617
However, I think that can be considered acceptable (and maybe even correct) behaviour. Postgres also has similar behaviour when DDL is incomplete. Some examples:
- When creating an index on an existing table in a transaction, the size of that index is only attributed to the table with pg_total_relation_size after the commit completes.
- When reindexing an index a temporary index is created, and thus its size is not attributed to the original index (but it is to the table).
From what I can tell, these functions in Postgres simply do whatever was simplest to implement. I think it's fine if we do the same.