neon icon indicating copy to clipboard operation
neon copied to clipboard

There should be no way of creating branch without specifying LSN, otherwise there is a race

Open yeputons opened this issue 2 years ago • 3 comments

Follow-up of the last failures in https://github.com/neondatabase/neon/issues/1667#issuecomment-1179001544, see also #2062.

Looking at our neon_local timeline branch command, it works entirely with pageserver: https://github.com/neondatabase/neon/blob/39d86ed29e9a2887e6dc339fbb0eeb6040c12cc8/neon_local/src/main.rs#L690-L710

Moreover, the test_tenant_relocation test runs it with --ancestor-branch-name parameter only, no --ancestor-start-lsn. So the pageserver decides what LSN to create the branch at. Hence, if the pageserver is not fully caught up with safekeepers, the branch starts a little earlier than it should, truncating parts of the last transaction (effectively erasing it from the second timeline, I guess).

For example, it may result in test_tenant_relocation failure even before there are relocations: it creates two branches, and the second branch is sometimes created slightly before the first is fully on the pageserver, resulting in divergence: https://app.circleci.com/pipelines/github/neondatabase/neon/7828/workflows/bce7b9dc-e390-4f42-8beb-1a2e81bf2b9d/jobs/80367

assert (500500,) == (1001000,)
  At index 0 diff: 500500 != 1001000
  Full diff:
  - (1001000,)
  + (500500,)
test_runner/batch_others/test_tenant_relocation.py:259: in test_tenant_relocation
    timeline_id_second, current_lsn_second = populate_branch(pg_second, create_table=False, expected_sum=1001000)
test_runner/batch_others/test_tenant_relocation.py:133: in populate_branch
    assert cur.fetchone() == (expected_sum, )
E   assert (500500,) == (1001000,)
E     At index 0 diff: 500500 != 1001000
E     Full diff:
E     - (1001000,)
E     + (500500,)

If so, I argue it's not a flaky test; it's a design bug. We should have precise semantics of what "branching off a named branch" is (probably in terms of commit_lsn/flush_lsn/whatever on Safekeepers), and there should be no way to violate it by misusing the API. For example, there should be no way to call the pageserver branch creation API without specifying the exact parent timeline and the exact LSN in it.

I suspect there may be other similar bugs lurking around; the pageserver is probably the source of truth in multiple places (see, e.g., #809). I suspect it would be impossible for our end users to debug unless we somehow expose both safekeepers/pageserver's LSNs, and it would be very hard if we do.

yeputons avatar Jul 08 '22 22:07 yeputons

This problem was mentioned before, see https://github.com/neondatabase/neon/issues/1371 proper solution requires rather big tweaks in console. E g when compute for the branch is running and you want to create a new branch you need to go to the compute, pick the current lsn here, and pass it to pageserver. I think this needs to be part of a console branching support epic

cc @ololobus

LizardWizzard avatar Jul 11 '22 08:07 LizardWizzard

when compute for the branch is running and you want to create a new branch you need to go to the compute, pick the current lsn here

That is racy anyway if compute is under load. Not sure how the API and UI should look like to the end user. Maybe we should always ask for the timestamp to branch? After that we will re-map it to LSN using the pageserver API and finally create a branch

Thanks for the reminder

ololobus avatar Jul 11 '22 09:07 ololobus

For example, it may result in test_tenant_relocation failure even before there are relocations: it creates two branches, and the second branch is sometimes created slightly before the first is fully on the pageserver, resulting in divergence: https://app.circleci.com/pipelines/github/neondatabase/neon/7828/workflows/bce7b9dc-e390-4f42-8beb-1a2e81bf2b9d/jobs/80367

This just happened again here: https://github.com/neondatabase/neon/runs/7413000700

hlinnaka avatar Jul 20 '22 11:07 hlinnaka

Current status:

  • test_tenant_relocations was fixed in #2131
  • We allow creating a "Head" branch in our web interface. It calls the createProjectBranch endpoint that uses Safekeeper's commit_lsn to figure out the LSN for branch creation: https://github.com/neondatabase/cloud/blob/e9936c1c62c41b1416f9f4a8f134a712b7d8a38b/goapp/internal/controllers/publicapiv2/branch_create.go#L483-L492 . No race condition here since https://github.com/neondatabase/cloud/pull/3003/files, which is good.
  • neon_local still goes directly to pageserver, race condition is possible: a user can do some SQL queries, then neon_local branch, and then see old data in the branch.

I think only these things need addressing:

  • [x] Our internal APIs can be better in not allowing branching without a specific LSN: https://github.com/neondatabase/neon/pull/3140
  • [x] neon_local branch should go to safekeepers for the branching LSN, just like the cloud: #3222
  • [x] Maybe some kind of communication to the users about race conditions. They should not encounter them unless they branch a database which has some active writes. It's an expected race condition, but it may be nice to mention in endpoint documents anyways: https://github.com/neondatabase/neon/issues/3141

yeputons avatar Dec 17 '22 02:12 yeputons

All PRs from the checklist above are created. As this is more of a tech debt issue, closing it now, it requires no further tracking.

yeputons avatar Dec 30 '22 18:12 yeputons