gpdb
gpdb copied to clipboard
ao/co: Support missing-mode ADD COLUMN for AOCO tables
NOTE: The initial design has been largely changed. Please read this comment and onwards for discussion about the new design. I'll update the this description section later.
This commit optimizes AT ADD COLUMN for ao_column tables in the similar way as ao_row tables: the new column will be regarded as "missing" for all existing rows at the time of ADD COLUMN. We record the "last row numbers" of each segment file for the new column. At table scanning time, we will use that information to decide if a certain row is missing. If so, we'll fetch the default value from pg_attribute.attmissingval instead of the file.
However, different than ao_row tables, for ao_column tables we cannot completely avoid writing: columns files in ao_column tables are scanned individually. So they have to have information about what rows exist in them. Therefore, during ADD COLUMN, we still need to scan an existing column, and write same number of rows in the new column. Currently, we are writing a NULL in place of the missing value. Therefore, this optimization is more useful when the column's default value is large. E.g.:
-- prep
create table co(a int) using ao_column;
insert into co select * from generate_series(1,10000000);
-- w/o the optimization
postgres=# alter table co add column b text default repeat('abc',100);
ALTER TABLE
ime: 11229.041 ms (00:11.229)
-- w/ the optimization
postgres=# alter table co add column b text default repeat('abc',100);
ALTER TABLE
Time: 477.533 ms
-- but much smaller benefit when default value size is small (like an integer)
-- w/o the optimization
postgres=# alter table co add column b int default 10;
ALTER TABLE
Time: 638.668 ms
-- w/ the optimization
postgres=# alter table co add column b int default 10;
ALTER TABLE
Time: 475.520 ms
We will also see some observable improvement over the SELECT performance since we avoid reading from disk for the missing columns. This particularly helpful for CO tables where the column is scanned individually.
-- prep
drop table if exists co;
create table co(a int) using ao_column;
insert into co select * from generate_series(1,10000000);
alter table co add column b int default 1;
alter table co add column c text default repeat('abc', 100);
\timing on
-- w/o the optimization
postgres=# select count(b) from co;
count
----------
10000000
(1 row)
Time: 1313.305 ms (00:01.313)
postgres=# select count(c) from co;
count
----------
10000000
(1 row)
Time: 3399.913 ms (00:03.400)
-- w/ the optimization
postgres=# select count(b) from co;
count
----------
10000000
(1 row)
Time: 1281.460 ms (00:01.281)
postgres=# select count(c) from co;
count
----------
10000000
(1 row)
Time: 3275.436 ms (00:03.275)
We do not simply write nothing (i.e. not even NULL) because it would be a far more invasive change as there are too many expectations in the code for the block header to be consistent with the block content. It would be hard to get it right and the potential benefit is not obvious (NULL is stored as bitmap in the AOCO storage and is processed fast). So we would want to wait and see if there's a real need to invest the time on it.
Also fixed an issue that we should allow alter column constraint to be done with alter column optimization, and related test issues.
Co-authored-by: Divyesh Vanjare [email protected] Co-authored-by: Huansong Fu [email protected]
Dev pipeline: https://dev.ci.gpdb.pivotal.io/teams/main/pipelines/aoco_addcol_no_write
Here are some reminders before you submit the pull request
- [ ] Add tests for the change
- [ ] Document changes
- [ ] Communicate in the mailing list if needed
- [ ] Pass
make installcheck - [ ] Review a PR in return to support the community