neon icon indicating copy to clipboard operation
neon copied to clipboard

Add more performance tests for branching

Open aome510 opened this issue 1 year ago • 13 comments

Resolves #1901

aome510 avatar Jul 21 '22 20:07 aome510

I wasn't be able to come up with more tests to measure our branching performance. Open to others' suggestions.

aome510 avatar Jul 21 '22 20:07 aome510

Results from the CI:

test_compare_child_and_root_pgbench_perf.root.number_of_clients: 10 
test_compare_child_and_root_pgbench_perf.root.number_of_threads: 1 
test_compare_child_and_root_pgbench_perf.root.number_of_transactions_actually_processed: 31198 
test_compare_child_and_root_pgbench_perf.root.latency_average: 3.201 ms
test_compare_child_and_root_pgbench_perf.root.tps: 3,124.4611 
test_compare_child_and_root_pgbench_perf.root.run_duration: 10.034 s
test_compare_child_and_root_pgbench_perf.root.run_start_timestamp: 1658436843 
test_compare_child_and_root_pgbench_perf.root.run_end_timestamp: 1658436853 
test_compare_child_and_root_pgbench_perf.child.number_of_clients: 10 
test_compare_child_and_root_pgbench_perf.child.number_of_threads: 1 
test_compare_child_and_root_pgbench_perf.child.number_of_transactions_actually_processed: 30 
test_compare_child_and_root_pgbench_perf.child.latency_average: 3,524.126 ms
test_compare_child_and_root_pgbench_perf.child.tps: 2.8376 
test_compare_child_and_root_pgbench_perf.child.run_duration: 10.633 s
test_compare_child_and_root_pgbench_perf.child.run_start_timestamp: 1658436853 
test_compare_child_and_root_pgbench_perf.child.run_end_timestamp: 1658436863 
test_compare_child_and_root_write_perf.root_run_duration: 6.704 s
test_compare_child_and_root_write_perf.child_run_duration: 7.078 s
test_compare_child_and_root_read_perf.root_run_duration: 1.187 s
test_compare_child_and_root_read_perf.child_run_duration: 2.281 s

Huge difference in the tps. Is that repeatable? If yes, let's fix it. If not, this test is too noisy.

hlinnaka avatar Jul 22 '22 07:07 hlinnaka

Huge difference in the tps. Is that repeatable? If yes, let's fix it. If not, this test is too noisy.

I also experienced that on my local, so it wasn't because of slow disk IOs on Mac. I did create a flamegraph yesterday. Look like because of slow wal redo

aome510 avatar Jul 22 '22 14:07 aome510

@hlinnaka I found the reason why there is such a huge difference. It's quite interesting that the child branch does a sequence scan to find an indexed column:

postgres=# explain analyze select * from pgbench_accounts where aid=10;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..22602.43 rows=1 width=97) (actual time=344.235..344.976 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..21602.33 rows=1 width=97) (actual time=267.328..341.883 rows=0 loops=3)
         Filter: (aid = 10)
         Rows Removed by Filter: 333333
 Planning Time: 1.367 ms
 Execution Time: 344.989 ms
(8 rows)

while the parent does not:

postgres=# explain analyze select * from pgbench_accounts where aid=10;
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using pgbench_accounts_pkey on pgbench_accounts  (cost=0.42..8.44 rows=1 width=97) (actual time=0.113..0.113 rows=1 loops=1)
   Index Cond: (aid = 10)
 Planning Time: 1.743 ms
 Execution Time: 0.171 ms
(4 rows)

The reason is that we determine the child's starting LSN based on the parent's last_record_lsn. In other words, if we create the child before the WALs initializing indices are processed by the parent, the child will not have any information about the indices.

To prove this is the cause, I added a sleep right before the branch creation in the test_compare_child_and_root_pgbench_perf:

    pg_bin.run_capture(['pgbench', '-i', pg_root.connstr(), '-s10'])

+    time.sleep(5)

    env.neon_cli.create_branch('child', 'root')
    pg_child = env.postgres.create_start('child')

Now, the child and root branches have comparable results.

Not sure how to fix this. Do you have any suggestions?

aome510 avatar Jul 25 '22 20:07 aome510

The reason is that we determine the child's starting LSN based on the parent's last_record_lsn. In other words, if we create the child before the WALs initializing indices are processed by the parent, the child will not have any information about the indices.

A-ha! So it's a similar case as https://github.com/neondatabase/neon/pull/2131. The fix could be similar here, get the latest LSN from Postgres (select pg_current_wal_lsn()) and pass that to the create_branch call.

This seems to happen quite frequently with our tests. Almost always when you call create_branch(), you would want to wait for all the in-flight WAL to be processed by the pageserver first. Could we do that in a more whole-sale fashion in the python tests, instead of adding a wait before every create_branch() call? Or make the 'lsn' argument mandatory to force the test writer to do it.

hlinnaka avatar Jul 25 '22 21:07 hlinnaka

Could we do that in a more whole-sale fashion in the python tests, instead of adding a wait before every create_branch() call? Or make the 'lsn' argument mandatory to force the test writer to do it.

Make the lsn required to call create_branch is not a trivial change. A go-to-reference to that function returns 90 results. A quicker way would be calculating the LSN using select pg_current_wal_lsn() in create_branch if no LSN is specified. I'll try to do that in this PR if the change is simple

aome510 avatar Jul 25 '22 21:07 aome510

A quicker way would be calculating the LSN using select pg_current_wal_lsn() in create_branch if no LSN is specified. I'll try to do that in this PR if the change is simple

Updated in https://github.com/neondatabase/neon/pull/2143/commits/b45f09109d9517cd7c1b81a29747e7d94e260fb1

aome510 avatar Jul 26 '22 16:07 aome510

There's a few different things going on here, with the "sequentialization" of the pgbench init steps, and the addition of new tests. I'd suggest splitting this up into multiple PRs so that you can commit some parts, while discussion and investigation continues on the others.

Agree. Will do the separation. At first, I thought the "create-branch" change is small, so it can fit into this patch. But it appeared to generate some issues.

aome510 avatar Jul 29 '22 14:07 aome510

Agree. Will do the separation.

Created #2189. I also added more comments explaining the new tests. I think we can either continue reviewing this patch while I'm investigating the "pgbench init" issue in other PR or waiting for the other PR to land to continue this PR.

The former, IMO, is more efficient, but not sure if it's ok to merge test_compare_child_and_root_pgbench_perf despite the issue described in https://github.com/neondatabase/neon/pull/2143#issuecomment-1192283904

aome510 avatar Jul 29 '22 14:07 aome510

Hmm, it would be good to dig deeper into that. Do you suspect it was a race condition in the pageserver? That sounds like a bug we should find and fix. I don't see anything inherently wrong with this test as it was.

Or maybe running multiple pgbench initializations just overloads the server, so that each initialization takes a long time? In that case, a longer statement_timeout would help. Or making things faster if possible. The slowest SQL command that 'pgbench -i` runs is probably a CREATE INDEX. (Or ALTER TABLE ADD PRIMARY KEY, which is essentially the same thing). That can legitimately take a long time, proportional to the scale factor. How long does it take in this test? How close is it to statement_timeout?

I did some investigation. Look like it was because of an error when creating branch from an "invalid" starting LSN. To recap the things I found when debugging this issue,

  • running a workload (pgbench init) on a branch b0
  • create a branch b1 based on b0's latest flush LSN 0/16BA000. ancestor last record lsn at the time of creation: 0/16B8DC0
  • start b1, sk refuses to start WAL streaming for b1:
Requested streaming from 0/16BA000, which is before the start of the timeline 0/16BA018
  • b1 cannot receive WAL because of sk, but the flush LSN stored inside the compute node of 1 is still advanced. Start at 0/16BA018, increase to 0/16BA050 -> 0/16C0000 -> 0/16C01C8 (maybe because of background writer)
  • As a result, creating b2 from b1 based on the latest flush LSN timeout as no new WAL is processed hence last_record_lsn of b2 never exceeds the starting LSN

This issue should be reproducible based on the changes in https://github.com/neondatabase/neon/tree/use-latest-lsn-on-create-branch and using this command

RUST_LOG=debug ./scripts/pytest -k test_branching_with_pgbench\[ca

My wild guess is that the starting LSN used to create b1 is invalid. There is probably a mismatch between what is stored in disk by neon and what is flushed as perceived by the compute node. 0/16BA000 should be probably rounded up/down to nearest valid LSN.

Tbh, currently I have no idea about the actual reason behind, so will need to dig into a bit deeper.

I only know how the branch is created from the neon side (populating metadata, create timeline directory). However, I have no idea how it is initialized from the postgres side. I'm not too familiar with the code. Looking at the log for branch b1,

2022-07-29 21:04:37.641 GMT [88384] LOG:  starting with zenith basebackup at LSN 0/16BA018, prev 0/0

It seems to determine the timeline's start LSN as 0/16BA018, but I have no idea where that number comes from. Would appreciate if anyone can provide instructions/places to better understand the PG initialization progress of new branch.

cc @hlinnaka

aome510 avatar Jul 29 '22 21:07 aome510

  • start b1, sk refuses to start WAL streaming for b1:
Requested streaming from 0/16BA000, which is before the start of the timeline 0/16BA018

I was able to find the reason behind this error. When generating a pgcontrol file during basebackup, we normalize the checkpoint.redo LSN:

        // Generate new pg_control needed for bootstrap
        checkpoint.redo = normalize_lsn(self.lsn, pg_constants::WAL_SEGMENT_SIZE).0;
/// If LSN points to the beginning of the page, then shift it to first record,
/// otherwise align on 8-bytes boundary (required for WAL records)
pub fn normalize_lsn(lsn: Lsn, seg_sz: usize) -> Lsn {
    if lsn.0 % XLOG_BLCKSZ as u64 == 0 {
        let hdr_size = if lsn.0 % seg_sz as u64 == 0 {
            XLOG_SIZE_OF_XLOG_LONG_PHD
        } else {
            XLOG_SIZE_OF_XLOG_SHORT_PHD
        };
        lsn + hdr_size as u64
    } else {
        lsn.align()
    }
}

So if, the specified starting LSN divided by XLOG_BLCKSZ (0/16BA000 in the above example), the checkpoint redo LSN will be larger than the starting LSN. Because the compute node relies on the checkpoint.redo as the WAL beginning LSN, this will result in an error as above.

I was able to consistently reproduce the above scenario using a simpler test in https://github.com/neondatabase/neon/pull/2189/files#diff-b8262ab057cd69a7bb4f457b8f0191c3337cb1571c6b659cc5afe4378c2a3880R96.

Should we fix this by calling normalize_lsn before branching?

cc @hlinnaka

aome510 avatar Aug 09 '22 08:08 aome510

Should we fix this by calling normalize_lsn before branching?

Updated in https://github.com/neondatabase/neon/pull/2189/commits/e4c3a3309349b4260780fb12e8241c217896e478

aome510 avatar Aug 10 '22 02:08 aome510

Rebased. This PR looks good to me, but I believe we still need to resolve https://github.com/neondatabase/neon/pull/2189 before this can be merged.

hlinnaka avatar Sep 05 '22 11:09 hlinnaka

Rebased again. https://github.com/neondatabase/neon/pull/2189 has now been resolved, so unless some new blockers arise for this PR, this is ready to be committed.

hlinnaka avatar Oct 07 '22 13:10 hlinnaka