risinglight icon indicating copy to clipboard operation
risinglight copied to clipboard

catalog: support composite primary key

Open skyzh opened this issue 2 years ago • 18 comments

Currently, our catalog cannot handle the case for composite primary key. For example,

create table t1 (x int not null, y int not null, primary key(x, y));
create table t2 (x int not null, y int not null, primary key(y, x));

These two tables will have exactly the same catalog entry, despite that the first one should have (x, y) as composite sort key, and the second one have (y, x). The order is different!

Therefore, we will need to refactor the catalog to correctly record such information. For TableCatalog, we should have a field primary_key_indices, and remove the is_primary field from ColumnDesc.

This task might involve some refactors and some regressions in functionalities. Need plan first.

skyzh avatar Apr 14 '22 05:04 skyzh

... may also be done together with https://github.com/risinglightdb/risinglight/issues/590

skyzh avatar Apr 14 '22 05:04 skyzh

assigned to @Shmiwy

skyzh avatar Apr 14 '22 05:04 skyzh

@skyzh It seems that we don't support primary key? when I type this sql

create table t (a int not null, b int not null, c int, d int, primary key(a, b));

the physicalCreateTable says that neither a nor b is primary key🤣 image

shmiwy avatar Apr 18 '22 09:04 shmiwy

Then our binder doesn't support such syntax now 🤣 Please also help refactor the binder to support primary key (a, b) syntax.

https://github.com/risinglightdb/risinglight/blob/a5644b13621063b4f38f45155defaf8157498739/src/binder/statement/create_table.rs#L20-L22

For example, the first statement produces AST like:

        CreateTable {
            or_replace: false,
            temporary: false,
            external: false,
            if_not_exists: false,
            name: ObjectName([Ident {
                value: "t1",
                quote_style: None,
            }]),
            columns: [
                ColumnDef {
                    name: Ident {
                        value: "x",
                        quote_style: None,
                    },
                    data_type: Int(None),
                    collation: None,
                    options: [ColumnOptionDef {
                        name: None,
                        option: NotNull,
                    }],
                },
                ColumnDef {
                    name: Ident {
                        value: "y",
                        quote_style: None,
                    },
                    data_type: Int(None),
                    collation: None,
                    options: [ColumnOptionDef {
                        name: None,
                        option: NotNull,
                    }],
                },
            ],
            constraints: [Unique {
                name: None,
                columns: [
                    Ident {
                        value: "x",
                        quote_style: None,
                    },
                    Ident {
                        value: "y",
                        quote_style: None,
                    },
                ],
                is_primary: true,
            }],
            hive_distribution: NONE,
            hive_formats: Some(HiveFormat {
                row_format: None,
                storage: None,
                location: None,
            }),
            table_properties: [],
            with_options: [],
            file_format: None,
            location: None,
            query: None,
            without_rowid: false,
            like: None,
            engine: None,
            default_charset: None,
            collation: None,
        };

We can read the constraints variable and assign the primary keys.

skyzh avatar Apr 18 '22 09:04 skyzh

create table t (a int not null primary key, b int not null, c int, d int);

By the way, currently the binder only supports primary key at this location 🤣

skyzh avatar Apr 18 '22 09:04 skyzh

Can we add a field called primary_key_indices in TableCatalog without removng is_primary field from ColumnDesc? I think we can just change the new function for TableCatalog(src/ catalog/ table.rs) around these line, https://github.com/risinglightdb/risinglight/blob/a5644b13621063b4f38f45155defaf8157498739/src/catalog/table.rs#L43-L46 add a judgment about col_catalog.is_primary() before line 44. If it is true, we can record the corresponding col_id in somewhere (maybe a vector). When we end up the for loop, we can change the primary_key_indices.

After we have created the a table, it seems that subsequent queries will not modify the catalog data like primary keys' id of the table. So it seems acceptable that we store redundant data about primary keys.🤣

shmiwy avatar Apr 18 '22 15:04 shmiwy

By the way, changing the binder to supports this kind of sql query seems easy.

create table t(a int not null, b int not null, c int, d int, primary key(a, b));

(emmm, also I still have one weird mistake I haven't figured out, one sql logic test falled????)🤪

shmiwy avatar Apr 18 '22 15:04 shmiwy

(emmm, also I still have one weird mistake I haven't figured out, one sql logic test falled????)🤪

You may send a draft PR and we can investigate it...

Can we add a field called primary_key_indices in TableCatalog without removng is_primary field from ColumnDesc? I think we can just change the new function for TableCatalog(src/ catalog/ table.rs) around these line,

Of course we can :) But from my perspective, is_primary won't be used after we have refactored all occurrences. We can have such intermediate state, and refactor it later.

skyzh avatar Apr 18 '22 15:04 skyzh

But from my perspective, is_primary won't be used after we have refactored all occurrences. We can have such intermediate state, and refactor it later.

I think it just make some thing like is_primary function for ColumnCatalog easier hhhh(I I find this function used in many places). if we remove it, maybe we need a more complex logic to Implement this function(maybe a extra filed in ColumnDesc to indicate which table the column belongs to, or a complex logic to scan travel from database to schema to table to column just to find whether the column is primary

shmiwy avatar Apr 18 '22 16:04 shmiwy

Or we need to refactor everything that uses the function (more bugs can emerge hhhh🤩)

shmiwy avatar Apr 18 '22 16:04 shmiwy

I think it just make some thing like is_primary function for ColumnCatalog easier

Or we need to refactor everything that uses the function

Exactly. If you want to refactor, we can do this. But storing is_primary inside ColumnDesc is more efficient -- we can do an O(1) hash lookup to determine whether a key is a primary key, and we don't need to refactor everywhere. For the first step, I'd like to retain it.

I find this function used in many places

For most of the places using is_primary (especially in planner and optimizer), it's used to determine the sort key of storage. However, most of the usages only consider single primary key. If we want to extend the case to multiple sort keys, a refactor is unavoidable. For example, BoundColumnRef's is_primary IMO should be removed in the future. We can do this later when we have a more detailed plan. Just keep a single PR concise and only solving one problem.

skyzh avatar Apr 18 '22 16:04 skyzh

still not tell difference between primary key(a, b) and primary key(b, a), i will do this int the not-too-distant future.

shmiwy avatar Apr 19 '22 03:04 shmiwy

on the first step, I'm going to add two filed in TableCatalog.Inner, one is pk_ids which is a HashSet to indicate which columns are primary, the other one is extra_ipk_ids. extra_pk_ids is a vector and will be used to record which columns are declared by "primary key(c1, c2, ..)" syntax.

for example, if we have a sql like

create table t (a int primary key, b int not null, c int not null, d int , primary key(b, c));

the pk_ids will record the column id for a b c, and the extra_pk_ids will record the column id for b c in order b followed by c.

I'm trying to maintain the data of the two filed without using any of the ColumnDesc information except column id, because we will remove is_primary filed in ColumnDesc in the future and refactor some code. So I will not use any of them.

To achieve this goal, I will refactor some codes when creating a table, add two filed in BoundCreateTable like 'TableCatalog,Inner', get the two filed data from stmt after parsing, pass them through binder, logical_plan .. until I call create_table on storage.

shmiwy avatar Apr 19 '22 08:04 shmiwy

create table t (a int primary key, b int not null, c int not null, d int , primary key(b, c));

I think we should forbid this in binder. We should only allow single primary key if a int primary key, or multiple primary keys using primary key(b, c) syntax.

skyzh avatar Apr 19 '22 08:04 skyzh

So, if

create table t (a int, b int not null, c int not null, d int , primary key(b, c));

We should have hashset to set b -> true, c -> true, and ordered_pk_ids to id of b, id of c.

If

create table t (a int primary key, b int not null, c int not null, d int);

We should set a -> true, while ordered_pk_ids only contain id of a.

I believe in the future, the HashSet won't be used, as all inquiries about pks will consider order. But we can keep it for now, to refactor everything little by little.

skyzh avatar Apr 19 '22 08:04 skyzh

To achieve this goal, I will refactor some codes when creating a table, add two filed in BoundCreateTable like 'TableCatalog,Inner', get the two filed data from stmt after parsing, pass them through binder, logical_plan .. until I call create_table on storage.

This plan looks great! We can proceed on this.

skyzh avatar Apr 19 '22 08:04 skyzh

ok, if so, I will only keep one vector in tableCatalog, and forbiden that syntax from binder

Sounds good to me.

skyzh avatar Apr 19 '22 08:04 skyzh

ok🥳

shmiwy avatar Apr 19 '22 08:04 shmiwy