gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

Implement multi-key sort on GPDB7

Open yaowangm opened this issue 1 year ago • 9 comments

Overview

Mksort (multi-key sort) is feature of GPDB6. When we merged code of pg12 to GP7, the code path of SORT in pg12 had been rewritten in a large degree (including some basic data structures), so it's impossible to simply keep the code of mksort. The only possible solution to support mksort on GP7 is to re-implement it.

The implementation is based on the paper: Jon L. Bentley and Robert Sedgewick, "Fast Algorithms for Sorting and Searching Strings", Jan 1997 https://www.cs.tufts.edu/~nr/cs257/archive/bob-sedgewick/fast-strings.pdf

Besides the implementation of core algorithm, the architectures of mksort are very different between GPDB6 and GPDB7. On GPDB6, mksort implemented a completely separated code path from ExecSort(), with custom data structures. On GPDB7, benefited by more clean code path of SORT, we can simply replace qsort with mksort in process of in-memory sort, without caring others.

An important improvement is: for IndexTuple, when all the columns are equal, we need to perform an additional comparing based on ItemPointer to determine the order. mksort on GPDB6 doesn't have the behavior, but it's needed on GPDB7 to make the result consistent to qsort.

Scope

Referring the implementation on GPDB6, the scope of mksort consists of three parts:

  1. Core algorithm to sort SortTuple
  2. Some call-back functions to handle different tuple types
  3. Mksort motion (not implemented on GPDB7 yet)

A summary of tuple types supported by mksort:

  • HeapTuple: supported
  • IndexTuple(btree): supported
  • IndexTuple(hash): not supported because there is only one key
  • DatumTuple: not supported because there is only one key
  • HeapTuple(for cluster): not supported because it's not supported by mksort on GPDB6 (see switcheroo_tuplesort_begin_cluster()), but may be supported as future work
  • HeapTuple(for repack): not supported because it doesn't exist on GPDB6, but may be supported as future work

Note when "Sorting with abbreviated keys" is enabled, each datum has two values: abbreviated key and "full" datum data. Special handling is needed for comparing datum.

A reference of "Sorting with abbreviated keys": https://brandur.org/sortsupport

Test

Generally, all test cases involving qsort_tuple() should be performed with/without gp_enable_mk_sort, but they are too much. I updated only the ones which got different result with/without gp_enable_mk_sort, and I suppose the changes are enough to verify the code.

The changes of test cases include:

  1. Generally, mksort should generate result exactly same to qsort. However some test cases don't. The reason is that SQL doesn't specify order on all possible columns, e.g. "select c1, c2 from t1 order by c1" will generate different results between mksort/qsort when c1 values are equal, and the solution is to order c2 as well ("select c1, c2 from t1 order by c1, c2").
  2. Something added to cover particular scenarios of mksort code path (see regress/stats_ext).
  3. Some cases need to be updated to display the new sort method "multi-key sort".
  4. regress/gp_sort was updated with new cases to cover some scenarios of mksort.

The pipeline: https://dev.ci.gpdb.pivotal.io/teams/main/pipelines/wayao_dev2

Performance

The tests were under release build, and GPDB6 was built on branch 6X_STABLE. The time means total time for executing query, but not the executing time of mksort itself. Note that there may still be something work to do for performance tuning, so the result is just for reference.

I used the script for some very basic tests:

\timing on
drop table t1;
create table t1 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
insert into t1 values (generate_series(1,499999), 0, 0, 0, 0, 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb');
update t1 set c2 = c1 % 100, c3 = c1 % 50, c4 = c1 % 10, c5 = c1 % 3;
update t1 set c6 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbb' || (c1 % 5)::text;
show gp_enable_mk_sort;
set statement_mem='1GB';
set Optimizer=off;
set gp_enable_mk_sort=off;
explain analyze select c1 from t1 order by c6, c5, c4, c3, c2, c1;

Enlarging statement_mem to ensure the sort happens on memory rather than spilling to disk.

GPDB7

Disable Mksort on gpdb7:

1035.364 ms
1026.386 ms
1030.415 ms

Enable Mksort on gpdb7:

723.907 ms
716.320 ms
762.511 ms

GPDB6

Disable Mksort on gpdb6:

gp_enable_motion_mk_sort=off:
5937.555 ms
5671.934 ms
5864.845 ms

gp_enable_motion_mk_sort=on:
5151.260 ms
5382.456 ms
5359.935 ms

Enable Mksort on gpdb6:

gp_enable_motion_mk_sort=off:
2346.139 ms
2347.470 ms
2321.878 ms

gp_enable_motion_mk_sort=on:
1734.077 ms
1732.768 ms
1736.268 ms

As expected, the performance is: gpdb7 w/ mksort > gpdb7 wo/ mksort > gpdb6 w/ mksort > gpdb6 wo/ mksort

With enabling mksort on gpdb7, we got about 42% perf improvement. If gp_enable_motion_mk_sort is implemented in future, we can expect extra improvement (maybe 10~30%, not sure for now).

With enabling mksort on gpdb6, we got about 150% perf improvement. With enabling gp_enable_motion_mk_sort, additional 53% improvement was achieved.

Basically, we got significant perf improvement on GPDB7 with mksort, but not much as GPDB6. As my understanding, the difference can be mainly explained by the fact that the code path of sort has been rewritten on GPDB7. We can also investigate more to GPDB6 code for anything missed on GPDB7 mksort implementation.

Another pert test by @interma : (TPCH-Q1): https://github.com/greenplum-db/gpdb/pull/17200#issuecomment-2011078881

TODO

These should be treated as separated works in future:

  1. Support mksort motion
  2. Support more tuple types

Co-authored-by: Hongxu Ma [email protected]

yaowangm avatar Mar 13 '24 11:03 yaowangm