Improvement ideas / backlog for single shard tables / tenant schema tables

Open onurctirtir opened this issue 2 years ago • 1 comments

Performance improvements that would be useful for both single-shard tables and tenant-schema tables:

[ ] Fix prepared statements / plan cache for single-shard tables.

DDL improvements that could provide a better user experience when creating / altering tenant-schema tables:

The following cannot be used to create a tenant table:
- [ ] CREATE TABLE tenant_546.users OF TYPE ..
- [ ] CREATE SCHEMA tenant_546 CREATE TABLE users ..

SQL / planner improvements that would be useful for both single-shard tables and tenant-schema tables:

[ ] (High priority) (Medium) Support INSERT with sublinks.
[ ] (High priority) (Medium) Support UPDATE with volatile functions.
[ ] "INSERT INTO single_shard_table SELECT .." cannot go through repartitioned insert-select. Tough I'm not sure if this is easily doable because the code-path essentially expect the target table to have shard-key.
[ ] Planner unnecessarily decides that an outer join of the FORM < recurring_rel LEFT JOIN single_shard_table > would result in recurring tuples but indeed this is not the case for single-shard tables, hence we unnecessarily go through recursive planner for such joins.
[ ] Support non-router MERGE with single-shard tables with / without distributed tables.

UX Improvements for tenant-schema tables that we might want to do depending on user feedback:

[ ] Allow having usual distributed / reference tables etc. in tenant schemas, via alter_distributed_table() / create_distributed_table() / create_distributed_table_concurrently() / create_reference_table() / undistribute_table() (somewhat bigger item).
[ ] Allow colocating tenant schemas.
[x] Support routing in pgbouncer based on search_path.
[ ] Enable foreign keys from reference tables to tenant tables without on update/delete cascade (should be fairly easy).

Technical / non-user-facing improvements for tenant-schema tables:

[ ] Generalize Citus local tables into a single-shard group that’s pinned to the coordinator (somewhat bigger item).
[ ] Evaluate if it's possible to combine ConvertNewTableIfNecessary() logic with Postprocess_CreateTable.

Operation improvements that are only about the single-shard tables (i.e., those are are not associated with a tenant schema):

[ ] create_distributed_table_concurrently() doesn't support creating a single shard table.
[ ] alter_distributed_table() doesn't support altering a single-shard table.
[ ] alter_distributed_table() doesn't support colocating a random table with a single-shard table.
[ ] split_shards() could allow splitting shard of a single shard table by accepting a distribution column argument. Alternatively, allowing create_distributed_table_concurrently() to accept a single-shard table would help with the same scenario without requiring a syntax change.

Jun 09 '23 10:06 onurctirtir

Usability issues observed when doing basic testing with django-tenants:

Django sends add/drop constraint commands together with set constraint commands in a single statement, as in:
- SET CONSTRAINTS fkey IMMEDIATE; ALTER TABLE referencing_tbl DROP CONSTRAINT fkey;
- ALTER TABLE referencing_tbl ADD COLUMN id integer DEFAULT 1 NOT NULL CONSTRAINT fkey REFERENCES referenced_tbl(id) DEFERRABLE INITIALLY DEFERRED; SET CONSTRAINTS fkey IMMEDIATE
And this results in following error due to the command that we send to workers: "cannot insert multiple commands into a prepared statement"

To fix those errors, we need to properly deparse the following commands without relying on original DDL stmt:
- [x] ALTER TABLE DROP CONSTRAINT (Fixed by https://github.com/citusdata/citus/pull/7012)
- [x] ALTER TABLE ADD COLUMN CONSTRAINT (Fixing this by https://github.com/citusdata/citus/pull/7032)
By default, django (and I believe some other frameworks) add an an "int" based "id" as generated identity to the model tables.

With https://github.com/citusdata/citus/pull/7008, we allowed using such generated identity columns in distributed tables to avoid breaking django migrations. However, any nextval() call made for the underlying sequence of such a column results in following error on workers: "nextval: reached maximum value of sequence"

Altering such a column to a bigint based one later on is not possible too: "cannot execute ALTER COLUMN command involving identity column".

Plus, undistributing a table that uses an identity column is not allowed too, which breaks some table-type-conversion operations such as creating a reference table from a Citus local table. This becomes an important problem, e.g., when creating reference tables for the shared data stored in public schema. Shared tables might have foreign keys to each other --as it's the case for some built-in django applications-- and calling create_reference_table() for such tables might yield an error if one of those shared tables have automatically been converted to a Citus local table due to a foreign key to a shared Citus reference table.
- [ ] (High priority) (Medium) Enable altering identity columns (at least the underlying datatype).
- [x] (High priority) (Medium) Allow undistributing tables that has identity column. If that's hard, at least have a smarter logic to convert Citus local tables to single-shard tables / reference tables without undistributing the table first. (https://github.com/citusdata/citus/pull/7131)
- [ ] (High priority) (Hard) Ultimately, get rid of the limitations regarding the usage of identity columns in Citus tables.

Jun 16 '23 14:06 onurctirtir