gpdb
gpdb copied to clipboard
Enable index only scan on ao/aocs table.
The index scan has disabled on ao/aocs table for that random access performance on the ao/aocs table is extremely low. But index only scan does not access data files. Its performance should be as good as the heap table. So index only scan should be enable on ao/aocs table.
The san node use eof in aoseg table and visibility map to exam the visibility of tuples in index. The member variables lastRowNum and firstRowNum are introduced into AOCSFetchDesc / AppendOnlyFetchDesc to describe the range of visible row numbers. Then use the visibility map to check the visibility of each tuple.
Another problem we encountered was the poor random access performance of the visibility map. So we added a bitmap to record the tuple id range of all the tuples visible.
Author: Jinbao Chen [email protected] Co-authored-by: Haolin Wang [email protected]
Here are some reminders before you submit the pull request
- [ ] Add tests for the change
- [ ] Document changes
- [ ] Communicate in the mailing list if needed
- [ ] Pass
make installcheck - [ ] Review a PR in return to support the community
This is trying to revise PR https://github.com/greenplum-db/gpdb/pull/10963
We introduced another Bitmap to improve the efficiency of checking the visibility of a tuple, can we apply this to the normal scan? Such as SeqScan (in
aocs_getnext()).If we can do that, can we get benefit from this?
That's in common AppendOnlyVisimap_IsVisible() function, could be reached at need. Will check it for sure.
Hi @haolinw , can you rebase your branch, squash and finalize the commits? I will take a look at this for sure starting Monday.
Thanks @soumyadeep2007 for following up this, I rebased it and will try to catchup your later comments.
Another problem we encountered was the poor random access performance of the visibility map. So we added a bitmap to record the tuple id range of all the tuples visible.
Would be helpful to describe the problem and design of this more and possibly have separate commit for it as is pure performance optimization which is separate from rest of the work in this PR.
Another problem we encountered was the poor random access performance of the visibility map. So we added a bitmap to record the tuple id range of all the tuples visible.
Would be helpful to describe the problem and design of this more and possibly have separate commit for it as is pure performance optimization which is separate from rest of the work in this PR.
Yeah, that makes sense. Separated it.
Another problem we encountered was the poor random access performance of the visibility map. So we added a bitmap to record the tuple id range of all the tuples visible.
Would be helpful to describe the problem and design of this more and possibly have separate commit for it as is pure performance optimization which is separate from rest of the work in this PR.
Yeah, that makes sense. Separated it.
Can you squash the changes or remove the commit "Use a fixed-length FileSegInfo* Array in AppendOnlyFetchDesc" ? I can't quite follow what is going on.
Another problem we encountered was the poor random access performance of the visibility map. So we added a bitmap to record the tuple id range of all the tuples visible.
Would be helpful to describe the problem and design of this more and possibly have separate commit for it as is pure performance optimization which is separate from rest of the work in this PR.
Yeah, that makes sense. Separated it.
Can you squash the changes or remove the commit "Use a fixed-length FileSegInfo* Array in AppendOnlyFetchDesc" ? I can't quite follow what is going on.
Done. I removed "Use a fixed-length FileSegInfo* Array" in my refactor version, instead, introducing two int arrays for the segno<->idx mapping purpose, refer to src/include/access/appendonlytid.h
Need to add a case to test a question from @soumyadeep2007: if the first insert operation was aborted after creating table, currently no data in visimap but has dead tuples in table, indexonlyscan still works?
Need to add a case to test a question from @soumyadeep2007: if the first insert operation was aborted after creating table, currently no data in visimap but has dead tuples in table, indexonlyscan still works?
I added a sq me commit with a targeted uao test. This scenario passes and other visibility scenarios added passes for both AO and CO.
The index scan has disabled on ao/aocs table for that random access performance on the ao/aocs table is extremely low. But index only scan does not access data files. Its performance should be as good as the heap table. So index only scan should be enable on ao/aocs table.
I would like to reflect on this aspect, just access blocks on disk being random isn't the reason for disabling index scans. Even visimap lookups in random order could yield to poor performance. We have seen this for vacuum task where index cleanup was performed based on visibility map and performance was not good at all. Thanks to PR https://github.com/greenplum-db/gpdb/pull/13255 which removed dependency on visimap to make that decision during vacuum's. Hence, need to quantify this aspect to possible to factor in when index-only scans should be picked for append-optimized tables.
Would be also helpful to understand the costing or factors based on which index-only scan will be picked for append-optimized tables (decision used for heap may not be relevant to make plan choice for AO) after this PR.
Also, lets make sure to update README with all the design for this feature
Please defer the segno optimization to a completely separate PR and remove from this PR entirely. Seg files don't have to be consulted for index only scans.
This is very important aspect to pay close attention too. In past (before 258ec96), pg_aoseg/pg_aocsseg stored EOF (or checking physical file for tuple presence) was the only source to perform transaction visibility checks for append-optimized tables (blockdirectory was only used for optimization).
With the change made by commit 258ec96 to align block directory reflect the reality of block information on-disk, now the design is - this is something we should very clearly mention in README if not already
- for non-indexed append-optimized tables EOF acts as source (along with visimap)
- for indexed append-optimized tables blockdirectory acts as source (along with visimap)
Commit 258ec96 changed the format version to AORelationVersion_PG12, so only tables with this format version should enable index-only scan using the above logic. So, tables in-place upgraded from GPDB6->GPDB7 with old block directory format should not pick index-only scans.
Also, lets make sure to update README with all the design for this feature
Yes, let's create a section in the readme titled "Index Only Scan", similar to the one created for unique indexes.
Commit 258ec96 changed the format version to
AORelationVersion_PG12, so only tables with this format version should enable index-only scan using the above logic. So, tables in-place upgraded from GPDB6->GPDB7 with old block directory format should not pick index-only scans.
You might be able to reuse ValidateRelationVersionForUniqueIndex for this. Please rename the function appropriately.
Commit 258ec96 changed the format version to
AORelationVersion_PG12, so only tables with this format version should enable index-only scan using the above logic. So, tables in-place upgraded from GPDB6->GPDB7 with old block directory format should not pick index-only scans.You might be able to reuse
ValidateRelationVersionForUniqueIndexfor this. Please rename the function appropriately.
Got it. Thanks Deep.
(1) Please defer the segno optimization to a completely separate PR and remove from this PR entirely. Seg files don't have to be consulted for index only scans.
Separate segno lookup optimization to a new PR https://github.com/greenplum-db/gpdb/pull/14541.
Commit 258ec96 changed the format version to
AORelationVersion_PG12, so only tables with this format version should enable index-only scan using the above logic. So, tables in-place upgraded from GPDB6->GPDB7 with old block directory format should not pick index-only scans.
Have a question on this: Seems there is no straightforward place to inject code to determine whether select indexonlyscan or not based on the format version of AO blkdir during planning phase (which needs I/O to access pg_aoseg). Here lists possible ways I can see to move forward to determine the final implementation:
- Inject code to estimate_rel_size() which seems the only place where could(or already) call GetFileSegInfo() to get the formatversion in the planer layer, then the later paths selection might be able to determine indexonlyscan based on that. This seems to be the best option if it can be implemented properly.
- If above option is not feasible, the indexonlyscan might be selected without knowing the underlying blkdir format, then error out if indexonlyscan in a pre-AORelationVersion_PG12 format, just like unique index does (
ValidateRelationVersionForUniqueIndex). This is the simplest option, but might introduce failure to upgrade hence documentation and workaround needs to be considered clearly. - More complicated, not error out in above case, instead, following the original logic to do physical scanning (call
scanToFetchTuple) after visimap check to get the final visibility of the tuple. This is similar like Heap behavior and seems more friendly than error. I saw we have already ruled out this option from previous comments, right ? List here to confirm.
I would try to implement the 1st option, if not success, then fall back to 2nd. Please let me know your advice or concerns. @ashwinstar @soumyadeep2007
- Inject code to estimate_rel_size() which seems the only place where could(or already) call GetFileSegInfo() to get the formatversion in the planer layer, then the later paths selection might be able to determine indexonlyscan based on that. This seems to be the best option if it can be implemented properly.
Failed to implement option 1 as no chance to read segment pg_aoseg info during query planning in coordinator, and don't think it is a good idea to introduce segment I/O here to determine whether picking indexonlyscan path or not. If I understand correctly, seems this option is not feasible.
Also, lets make sure to update README with all the design for this feature
Yes, let's create a section in the readme titled "Index Only Scan", similar to the one created for unique indexes.
Will do that once finalize all designing aspects.
- We talked about borrowing some of the code or creating similar function to
blkdir_entry_exists, which was introduced for uniqueness checks. We can't directly use it since it assumes a dirty snapshot and starts a scan -> scans -> stops the scan on every call.- AppendOnlyBlockDirectory_GetEntry should not be used as it can sometimes return true even though a block directory doesn't exist (See how we have to call
scanToFetchTupleadditionally inappendonly_fetchto determine visibility)
Using AppendOnlyBlockDirectory_GetEntry + AppendOnlyBlockDirectoryEntry_RangeHasRow (+ visimap) to achieve post-AORelationVersion_PG12 format visibility check. So current workflow still does sysscan (begin-end) per AppendOnlyBlockDirectory_GetEntry() call. Since current new code didn't call sysscan (begin-end) directly, so didn't do the optimization of calling syscan (begin-end) only once for entire indexonlyscan. Maybe consider it later?
@soumyadeep2007 Let's move forward :-). I think the pg_appendonly part should be ready to merge. Next, I'm going to work on block directory refactor. Some questions come up after reviewing your previous comments below:
We need a way to store the blkdir sysscan desc in the AppendOnlyBlockDirectory struct (AppendOnlyBlockDirectory->scan). We should refactor AppendOnlyBlockDirectory_GetEntry() to not set up the sysscan desc multiple times, and use this struct member. We can initialize it in AppendOnlyBlockDirectory_Init_forSearch(). This struct member will be used later on in our IndexOnlyScan code.
Don't quite understand why we need to cache the sysscan for indexscan on blkdir as the scan key is not always same per call which is not like seqscan (which I introdced and cache that scan in fast-analyze).
If it is not necessary, I would not touch GetEntry and try to refactor AO indexonlyscan by using blkdir_entry_exists. Sounds OK ?
Also, if there's no work on GetEntry, is that OK to put the other two renaming comments(item 1&3 above) in a separate commit in this PR (not a separate PR) ?
Table AM API: bool table_index_fetch_tuple_visible(Relation rel, ItemPointer tid, Snapshot snapshot) We shouldn't need to call *_fetch_init() as that does too much initialization. Those functions are meant to perform tuple fetches from the actual table, and are used for regular index scans.
If we don't "fetch" tuple, is table_index_tuple_visible() a better name ?
Don't quite understand why we need to cache the sysscan for indexscan on blkdir as the scan key is not always same per call which is not like seqscan (which I introdced and cache that scan in fast-analyze).
You are right its not the same. I don't know how I overlooked that.
If it is not necessary, I would not touch GetEntry and try to refactor AO indexonlyscan by using blkdir_entry_exists. Sounds OK ?
Sounds good!
Also, if there's no work on GetEntry, is that OK to put the other two renaming comments(item 1&3 https://github.com/greenplum-db/gpdb/pull/13766#pullrequestreview-1266599546) in a separate commit in this PR (not a separate PR) ?
If you prefer, no problem!
If we don't "fetch" tuple, is table_index_tuple_visible() a better name ?
Yes, but if you see IndexOnlyNext(), we call index_beginscan() which calls table_index_fetch_begin() and so on. So I wanted to add table_index_fetch_tuple_visible() to that family by keeping the table_index_fetch prefix. We can make things clear with comments.
Yes, but if you see IndexOnlyNext(), we call index_beginscan() which calls table_index_fetch_begin() and so on. So I wanted to add table_index_fetch_tuple_visible() to that family by keeping the table_index_fetch prefix. We can make things clear with comments.
I see. That works for me for using table_index_fetch_tuple_visible.
I think I have addressed most of the previous comments, mainly including:
- ban plan based on pg_appendonly.version
- refactor indexonlyscan tableam based on
blkdir_entry_exists
Please continue to review, thanks. cc @soumyadeep2007 @ashwinstar.