gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

Enable index only scan on ao/aocs table.

Open haolinw opened this issue 3 years ago • 2 comments

The index scan has disabled on ao/aocs table for that random access performance on the ao/aocs table is extremely low. But index only scan does not access data files. Its performance should be as good as the heap table. So index only scan should be enable on ao/aocs table.

The san node use eof in aoseg table and visibility map to exam the visibility of tuples in index. The member variables lastRowNum and firstRowNum are introduced into AOCSFetchDesc / AppendOnlyFetchDesc to describe the range of visible row numbers. Then use the visibility map to check the visibility of each tuple.

Another problem we encountered was the poor random access performance of the visibility map. So we added a bitmap to record the tuple id range of all the tuples visible.

Author: Jinbao Chen [email protected] Co-authored-by: Haolin Wang [email protected]

Here are some reminders before you submit the pull request

  • [ ] Add tests for the change
  • [ ] Document changes
  • [ ] Communicate in the mailing list if needed
  • [ ] Pass make installcheck
  • [ ] Review a PR in return to support the community

haolinw avatar Jul 06 '22 02:07 haolinw

This is trying to revise PR https://github.com/greenplum-db/gpdb/pull/10963

haolinw avatar Jul 06 '22 02:07 haolinw

We introduced another Bitmap to improve the efficiency of checking the visibility of a tuple, can we apply this to the normal scan? Such as SeqScan (in aocs_getnext()).

If we can do that, can we get benefit from this?

That's in common AppendOnlyVisimap_IsVisible() function, could be reached at need. Will check it for sure.

haolinw avatar Aug 17 '22 08:08 haolinw

Hi @haolinw , can you rebase your branch, squash and finalize the commits? I will take a look at this for sure starting Monday.

soumyadeep2007 avatar Nov 18 '22 17:11 soumyadeep2007

Thanks @soumyadeep2007 for following up this, I rebased it and will try to catchup your later comments.

haolinw avatar Nov 21 '22 08:11 haolinw

Another problem we encountered was the poor random access performance of the visibility map. So we added a bitmap to record the tuple id range of all the tuples visible.

Would be helpful to describe the problem and design of this more and possibly have separate commit for it as is pure performance optimization which is separate from rest of the work in this PR.

ashwinstar avatar Nov 22 '22 01:11 ashwinstar

Another problem we encountered was the poor random access performance of the visibility map. So we added a bitmap to record the tuple id range of all the tuples visible.

Would be helpful to describe the problem and design of this more and possibly have separate commit for it as is pure performance optimization which is separate from rest of the work in this PR.

Yeah, that makes sense. Separated it.

haolinw avatar Nov 22 '22 09:11 haolinw

Another problem we encountered was the poor random access performance of the visibility map. So we added a bitmap to record the tuple id range of all the tuples visible.

Would be helpful to describe the problem and design of this more and possibly have separate commit for it as is pure performance optimization which is separate from rest of the work in this PR.

Yeah, that makes sense. Separated it.

Can you squash the changes or remove the commit "Use a fixed-length FileSegInfo* Array in AppendOnlyFetchDesc" ? I can't quite follow what is going on.

soumyadeep2007 avatar Nov 22 '22 22:11 soumyadeep2007

Another problem we encountered was the poor random access performance of the visibility map. So we added a bitmap to record the tuple id range of all the tuples visible.

Would be helpful to describe the problem and design of this more and possibly have separate commit for it as is pure performance optimization which is separate from rest of the work in this PR.

Yeah, that makes sense. Separated it.

Can you squash the changes or remove the commit "Use a fixed-length FileSegInfo* Array in AppendOnlyFetchDesc" ? I can't quite follow what is going on.

Done. I removed "Use a fixed-length FileSegInfo* Array" in my refactor version, instead, introducing two int arrays for the segno<->idx mapping purpose, refer to src/include/access/appendonlytid.h

haolinw avatar Nov 23 '22 00:11 haolinw

Need to add a case to test a question from @soumyadeep2007: if the first insert operation was aborted after creating table, currently no data in visimap but has dead tuples in table, indexonlyscan still works?

haolinw avatar Nov 23 '22 00:11 haolinw

Need to add a case to test a question from @soumyadeep2007: if the first insert operation was aborted after creating table, currently no data in visimap but has dead tuples in table, indexonlyscan still works?

I added a sq me commit with a targeted uao test. This scenario passes and other visibility scenarios added passes for both AO and CO.

soumyadeep2007 avatar Nov 23 '22 00:11 soumyadeep2007

The index scan has disabled on ao/aocs table for that random access performance on the ao/aocs table is extremely low. But index only scan does not access data files. Its performance should be as good as the heap table. So index only scan should be enable on ao/aocs table.

I would like to reflect on this aspect, just access blocks on disk being random isn't the reason for disabling index scans. Even visimap lookups in random order could yield to poor performance. We have seen this for vacuum task where index cleanup was performed based on visibility map and performance was not good at all. Thanks to PR https://github.com/greenplum-db/gpdb/pull/13255 which removed dependency on visimap to make that decision during vacuum's. Hence, need to quantify this aspect to possible to factor in when index-only scans should be picked for append-optimized tables.

ashwinstar avatar Nov 23 '22 02:11 ashwinstar

Would be also helpful to understand the costing or factors based on which index-only scan will be picked for append-optimized tables (decision used for heap may not be relevant to make plan choice for AO) after this PR.

ashwinstar avatar Nov 23 '22 02:11 ashwinstar

Also, lets make sure to update README with all the design for this feature

ashwinstar avatar Nov 23 '22 02:11 ashwinstar

Please defer the segno optimization to a completely separate PR and remove from this PR entirely. Seg files don't have to be consulted for index only scans.

This is very important aspect to pay close attention too. In past (before 258ec96), pg_aoseg/pg_aocsseg stored EOF (or checking physical file for tuple presence) was the only source to perform transaction visibility checks for append-optimized tables (blockdirectory was only used for optimization).

With the change made by commit 258ec96 to align block directory reflect the reality of block information on-disk, now the design is - this is something we should very clearly mention in README if not already

  • for non-indexed append-optimized tables EOF acts as source (along with visimap)
  • for indexed append-optimized tables blockdirectory acts as source (along with visimap)

Commit 258ec96 changed the format version to AORelationVersion_PG12, so only tables with this format version should enable index-only scan using the above logic. So, tables in-place upgraded from GPDB6->GPDB7 with old block directory format should not pick index-only scans.

ashwinstar avatar Nov 23 '22 19:11 ashwinstar

Also, lets make sure to update README with all the design for this feature

Yes, let's create a section in the readme titled "Index Only Scan", similar to the one created for unique indexes.

soumyadeep2007 avatar Nov 23 '22 19:11 soumyadeep2007

Commit 258ec96 changed the format version to AORelationVersion_PG12, so only tables with this format version should enable index-only scan using the above logic. So, tables in-place upgraded from GPDB6->GPDB7 with old block directory format should not pick index-only scans.

You might be able to reuse ValidateRelationVersionForUniqueIndex for this. Please rename the function appropriately.

soumyadeep2007 avatar Nov 24 '22 21:11 soumyadeep2007

Commit 258ec96 changed the format version to AORelationVersion_PG12, so only tables with this format version should enable index-only scan using the above logic. So, tables in-place upgraded from GPDB6->GPDB7 with old block directory format should not pick index-only scans.

You might be able to reuse ValidateRelationVersionForUniqueIndex for this. Please rename the function appropriately.

Got it. Thanks Deep.

haolinw avatar Nov 25 '22 00:11 haolinw

(1) Please defer the segno optimization to a completely separate PR and remove from this PR entirely. Seg files don't have to be consulted for index only scans.

Separate segno lookup optimization to a new PR https://github.com/greenplum-db/gpdb/pull/14541.

haolinw avatar Nov 28 '22 03:11 haolinw

Commit 258ec96 changed the format version to AORelationVersion_PG12, so only tables with this format version should enable index-only scan using the above logic. So, tables in-place upgraded from GPDB6->GPDB7 with old block directory format should not pick index-only scans.

Have a question on this: Seems there is no straightforward place to inject code to determine whether select indexonlyscan or not based on the format version of AO blkdir during planning phase (which needs I/O to access pg_aoseg). Here lists possible ways I can see to move forward to determine the final implementation:

  1. Inject code to estimate_rel_size() which seems the only place where could(or already) call GetFileSegInfo() to get the formatversion in the planer layer, then the later paths selection might be able to determine indexonlyscan based on that. This seems to be the best option if it can be implemented properly.
  2. If above option is not feasible, the indexonlyscan might be selected without knowing the underlying blkdir format, then error out if indexonlyscan in a pre-AORelationVersion_PG12 format, just like unique index does (ValidateRelationVersionForUniqueIndex). This is the simplest option, but might introduce failure to upgrade hence documentation and workaround needs to be considered clearly.
  3. More complicated, not error out in above case, instead, following the original logic to do physical scanning (call scanToFetchTuple) after visimap check to get the final visibility of the tuple. This is similar like Heap behavior and seems more friendly than error. I saw we have already ruled out this option from previous comments, right ? List here to confirm.

I would try to implement the 1st option, if not success, then fall back to 2nd. Please let me know your advice or concerns. @ashwinstar @soumyadeep2007

haolinw avatar Nov 28 '22 12:11 haolinw

  1. Inject code to estimate_rel_size() which seems the only place where could(or already) call GetFileSegInfo() to get the formatversion in the planer layer, then the later paths selection might be able to determine indexonlyscan based on that. This seems to be the best option if it can be implemented properly.

Failed to implement option 1 as no chance to read segment pg_aoseg info during query planning in coordinator, and don't think it is a good idea to introduce segment I/O here to determine whether picking indexonlyscan path or not. If I understand correctly, seems this option is not feasible.

haolinw avatar Nov 29 '22 06:11 haolinw

Also, lets make sure to update README with all the design for this feature

Yes, let's create a section in the readme titled "Index Only Scan", similar to the one created for unique indexes.

Will do that once finalize all designing aspects.

haolinw avatar Nov 29 '22 06:11 haolinw

  • We talked about borrowing some of the code or creating similar function to blkdir_entry_exists, which was introduced for uniqueness checks. We can't directly use it since it assumes a dirty snapshot and starts a scan -> scans -> stops the scan on every call.
  • AppendOnlyBlockDirectory_GetEntry should not be used as it can sometimes return true even though a block directory doesn't exist (See how we have to call scanToFetchTuple additionally in appendonly_fetch to determine visibility)

Using AppendOnlyBlockDirectory_GetEntry + AppendOnlyBlockDirectoryEntry_RangeHasRow (+ visimap) to achieve post-AORelationVersion_PG12 format visibility check. So current workflow still does sysscan (begin-end) per AppendOnlyBlockDirectory_GetEntry() call. Since current new code didn't call sysscan (begin-end) directly, so didn't do the optimization of calling syscan (begin-end) only once for entire indexonlyscan. Maybe consider it later?

haolinw avatar Nov 29 '22 08:11 haolinw

@soumyadeep2007 Let's move forward :-). I think the pg_appendonly part should be ready to merge. Next, I'm going to work on block directory refactor. Some questions come up after reviewing your previous comments below:

We need a way to store the blkdir sysscan desc in the AppendOnlyBlockDirectory struct (AppendOnlyBlockDirectory->scan). We should refactor AppendOnlyBlockDirectory_GetEntry() to not set up the sysscan desc multiple times, and use this struct member. We can initialize it in AppendOnlyBlockDirectory_Init_forSearch(). This struct member will be used later on in our IndexOnlyScan code.

Don't quite understand why we need to cache the sysscan for indexscan on blkdir as the scan key is not always same per call which is not like seqscan (which I introdced and cache that scan in fast-analyze).

If it is not necessary, I would not touch GetEntry and try to refactor AO indexonlyscan by using blkdir_entry_exists. Sounds OK ?

Also, if there's no work on GetEntry, is that OK to put the other two renaming comments(item 1&3 above) in a separate commit in this PR (not a separate PR) ?

Table AM API: bool table_index_fetch_tuple_visible(Relation rel, ItemPointer tid, Snapshot snapshot) We shouldn't need to call *_fetch_init() as that does too much initialization. Those functions are meant to perform tuple fetches from the actual table, and are used for regular index scans.

If we don't "fetch" tuple, is table_index_tuple_visible() a better name ?

haolinw avatar Mar 04 '23 01:03 haolinw

Don't quite understand why we need to cache the sysscan for indexscan on blkdir as the scan key is not always same per call which is not like seqscan (which I introdced and cache that scan in fast-analyze).

You are right its not the same. I don't know how I overlooked that.

If it is not necessary, I would not touch GetEntry and try to refactor AO indexonlyscan by using blkdir_entry_exists. Sounds OK ?

Sounds good!

Also, if there's no work on GetEntry, is that OK to put the other two renaming comments(item 1&3 https://github.com/greenplum-db/gpdb/pull/13766#pullrequestreview-1266599546) in a separate commit in this PR (not a separate PR) ?

If you prefer, no problem!

If we don't "fetch" tuple, is table_index_tuple_visible() a better name ?

Yes, but if you see IndexOnlyNext(), we call index_beginscan() which calls table_index_fetch_begin() and so on. So I wanted to add table_index_fetch_tuple_visible() to that family by keeping the table_index_fetch prefix. We can make things clear with comments.

soumyadeep2007 avatar Mar 06 '23 06:03 soumyadeep2007

Yes, but if you see IndexOnlyNext(), we call index_beginscan() which calls table_index_fetch_begin() and so on. So I wanted to add table_index_fetch_tuple_visible() to that family by keeping the table_index_fetch prefix. We can make things clear with comments.

I see. That works for me for using table_index_fetch_tuple_visible.

haolinw avatar Mar 06 '23 06:03 haolinw

I think I have addressed most of the previous comments, mainly including:

  • ban plan based on pg_appendonly.version
  • refactor indexonlyscan tableam based on blkdir_entry_exists

Please continue to review, thanks. cc @soumyadeep2007 @ashwinstar.

haolinw avatar Mar 07 '23 12:03 haolinw