risinglight
risinglight copied to clipboard
catalog: support composite primary key
Currently, our catalog cannot handle the case for composite primary key. For example,
create table t1 (x int not null, y int not null, primary key(x, y));
create table t2 (x int not null, y int not null, primary key(y, x));
These two tables will have exactly the same catalog entry, despite that the first one should have (x, y) as composite sort key, and the second one have (y, x). The order is different!
Therefore, we will need to refactor the catalog to correctly record such information. For TableCatalog
, we should have a field primary_key_indices
, and remove the is_primary
field from ColumnDesc
.
This task might involve some refactors and some regressions in functionalities. Need plan first.
... may also be done together with https://github.com/risinglightdb/risinglight/issues/590
assigned to @Shmiwy
@skyzh It seems that we don't support primary key? when I type this sql
create table t (a int not null, b int not null, c int, d int, primary key(a, b));
the physicalCreateTable says that neither a nor b is primary key🤣
Then our binder doesn't support such syntax now 🤣 Please also help refactor the binder to support primary key (a, b) syntax.
https://github.com/risinglightdb/risinglight/blob/a5644b13621063b4f38f45155defaf8157498739/src/binder/statement/create_table.rs#L20-L22
For example, the first statement produces AST like:
CreateTable {
or_replace: false,
temporary: false,
external: false,
if_not_exists: false,
name: ObjectName([Ident {
value: "t1",
quote_style: None,
}]),
columns: [
ColumnDef {
name: Ident {
value: "x",
quote_style: None,
},
data_type: Int(None),
collation: None,
options: [ColumnOptionDef {
name: None,
option: NotNull,
}],
},
ColumnDef {
name: Ident {
value: "y",
quote_style: None,
},
data_type: Int(None),
collation: None,
options: [ColumnOptionDef {
name: None,
option: NotNull,
}],
},
],
constraints: [Unique {
name: None,
columns: [
Ident {
value: "x",
quote_style: None,
},
Ident {
value: "y",
quote_style: None,
},
],
is_primary: true,
}],
hive_distribution: NONE,
hive_formats: Some(HiveFormat {
row_format: None,
storage: None,
location: None,
}),
table_properties: [],
with_options: [],
file_format: None,
location: None,
query: None,
without_rowid: false,
like: None,
engine: None,
default_charset: None,
collation: None,
};
We can read the constraints
variable and assign the primary keys.
create table t (a int not null primary key, b int not null, c int, d int);
By the way, currently the binder only supports primary key at this location 🤣
Can we add a field called primary_key_indices
in TableCatalog
without removng is_primary
field from ColumnDesc
?
I think we can just change the new
function for TableCatalog
(src/ catalog/ table.rs) around these line,
https://github.com/risinglightdb/risinglight/blob/a5644b13621063b4f38f45155defaf8157498739/src/catalog/table.rs#L43-L46
add a judgment about col_catalog.is_primary()
before line 44. If it is true
, we can record the corresponding col_id in somewhere (maybe a vector). When we end up the for loop, we can change the primary_key_indices
.
After we have created the a table, it seems that subsequent queries will not modify the catalog data like primary keys' id of the table. So it seems acceptable that we store redundant data about primary keys.🤣
By the way, changing the binder to supports this kind of sql query seems easy.
create table t(a int not null, b int not null, c int, d int, primary key(a, b));
(emmm, also I still have one weird mistake I haven't figured out, one sql logic test falled????)🤪
(emmm, also I still have one weird mistake I haven't figured out, one sql logic test falled????)🤪
You may send a draft PR and we can investigate it...
Can we add a field called primary_key_indices in TableCatalog without removng is_primary field from ColumnDesc? I think we can just change the new function for TableCatalog(src/ catalog/ table.rs) around these line,
Of course we can :) But from my perspective, is_primary
won't be used after we have refactored all occurrences. We can have such intermediate state, and refactor it later.
But from my perspective,
is_primary
won't be used after we have refactored all occurrences. We can have such intermediate state, and refactor it later.
I think it just make some thing like is_primary
function for ColumnCatalog
easier hhhh(I I find this function used in many places). if we remove it, maybe we need a more complex logic to Implement this function(maybe a extra filed in ColumnDesc
to indicate which table the column belongs to, or a complex logic to scan travel from database to schema to table to column just to find whether the column is primary
Or we need to refactor everything that uses the function (more bugs can emerge hhhh🤩)
I think it just make some thing like is_primary function for ColumnCatalog easier
Or we need to refactor everything that uses the function
Exactly. If you want to refactor, we can do this. But storing is_primary
inside ColumnDesc
is more efficient -- we can do an O(1) hash lookup to determine whether a key is a primary key, and we don't need to refactor everywhere. For the first step, I'd like to retain it.
I find this function used in many places
For most of the places using is_primary
(especially in planner and optimizer), it's used to determine the sort key of storage. However, most of the usages only consider single primary key. If we want to extend the case to multiple sort keys, a refactor is unavoidable. For example, BoundColumnRef
's is_primary
IMO should be removed in the future. We can do this later when we have a more detailed plan. Just keep a single PR concise and only solving one problem.
still not tell difference between primary key(a, b) and primary key(b, a), i will do this int the not-too-distant future.
on the first step, I'm going to add two filed in TableCatalog.Inner
, one is pk_ids
which is a HashSet to indicate which columns are primary, the other one is extra_ipk_ids
. extra_pk_ids
is a vector and will be used to record which columns are declared by "primary key(c1, c2, ..)" syntax.
for example, if we have a sql like
create table t (a int primary key, b int not null, c int not null, d int , primary key(b, c));
the pk_ids
will record the column id for a b c, and the extra_pk_ids
will record the column id for b c in order b followed by c.
I'm trying to maintain the data of the two filed without using any of the ColumnDesc
information except column id
, because we will remove is_primary
filed in ColumnDesc
in the future and refactor some code. So I will not use any of them.
To achieve this goal, I will refactor some codes when creating a table, add two filed in BoundCreateTable
like 'TableCatalog,Inner', get the two filed data from stmt
after parsing, pass them through binder, logical_plan .. until I call create_table
on storage.
create table t (a int primary key, b int not null, c int not null, d int , primary key(b, c));
I think we should forbid this in binder. We should only allow single primary key if a int primary key
, or multiple primary keys using primary key(b, c)
syntax.
So, if
create table t (a int, b int not null, c int not null, d int , primary key(b, c));
We should have hashset to set b -> true, c -> true, and ordered_pk_ids
to id of b
, id of c
.
If
create table t (a int primary key, b int not null, c int not null, d int);
We should set a -> true, while ordered_pk_ids only contain id of a.
I believe in the future, the HashSet won't be used, as all inquiries about pks will consider order. But we can keep it for now, to refactor everything little by little.
To achieve this goal, I will refactor some codes when creating a table, add two filed in BoundCreateTable like 'TableCatalog,Inner', get the two filed data from stmt after parsing, pass them through binder, logical_plan .. until I call create_table on storage.
This plan looks great! We can proceed on this.
ok, if so, I will only keep one vector in tableCatalog, and forbiden that syntax from binder
Sounds good to me.
ok🥳