neon Collapse WAL records done inside one transaction into page image

Sep 13 '22 20:09 knizhnik

Results at my laptop (default settings but wal_log_hints=off):

pgbench -i -s 100:

main:
done in 52.28 s (drop tables 0.00 s, create tables 0.05 s, client-side generate 16.41 s, vacuum 20.43 s, primary keys 15.38 s).
collapase_wal_records:
done in 45.74 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 15.57 s, vacuum 14.86 s, primary keys 15.29 s).

So the difference is about 30% speed-up of vacuum (because of elimination of page reconstruction). Index building time (at my computer) is mostly limited by disk write speed and backpressure. So there is almost no difference.

Sep 14 '22 05:09 knizhnik

More performance comparison results:

pgbench -s -s 1000:

main:
done in 569.47 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 184.40 s, vacuum 192.11 s, primary keys 192.95 s).
collapase_wal_records:
done in 598.68 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 236.35 s, vacuum 164.52 s, primary keys 197.80 s).

Not using COPY (multiinsert):

create table t(x integer);
insert into t values (generate_series(1,10000000));
select sum(x) from t;

main:
   insert: 7806,240 ms
   select: 7562,439 ms
   tenant size: 976M
collapase_wal_records:
   insert: 8899,540 ms
   select: 2432,606 ms
   tenant size: 424M

So in case of individual inserts effect is much noticeable: two times reduced storage size and more than 3 time reduced time of table traversal.

Sep 14 '22 06:09 knizhnik

The safe way to do this is to have a global view of the whole WAL stream, and only coalesce records between commit records. For example, imagine this WAL stream:

1: HEAP_INSERT 2: BTREE_INSERT 3: HEAP_INSERT 4: BTREE_INSERT 5: COMMIT 6: HEAP_INSERT 7: BTREE_INSERT

You could rewrite the LSNs on records 1-4, without affecting what snapshots see, because there are no commit records in that range. For example, you could rewrite the above into:

1: HEAP_INSERT 1: BTREE_INSERT 1: HEAP_INSERT 1: BTREE_INSERT 5: COMMIT 6: HEAP_INSERT 6: BTREE_INSERT

If you do that, you can coalesce records 1. and 3., and requests 2 and 4. That's safe, because they all now have the same LSN, so any PITR will see all of the records or none of them.

Sep 14 '22 07:09 hlinnaka

The safe way to do this is to have a global view of the whole WAL stream, and only coalesce records between commit records.

But I still do not understand why it is necessary to avoid commits of other transactions. The problem of enforcing heap-index consistency is clear to me. But why do we need to check original WAL stream for presence of commits? Assume that there are two concurrent transactions A and B which AAL records are intermixed in WAL.

If them do not update the same pages, then we can collapsed records of both of them independently, doesn't matter when any of them is committed. If them are updating the same pages, then we have checks that all collpased records beongs to to one transaction. So if we have:

A: INSERT R1 A: INSERT R2 B: INSERT R3 A: INSERT R4

then we will collapse R1 and R2 but not R3 and R4.

Sep 14 '22 07:09 knizhnik

The safe way to do this is to have a global view of the whole WAL stream, and only coalesce records between commit records.

But I still do not understand why it is necessary to avoid commits of other transactions. The problem of enforcing heap-index consistency is clear to me. But why do we need to check original WAL stream for presence of commits?

Well, maintaining the heap-index consistency is hard for starters. The simplest safe way to do that that I can see is what I described in the earlier comment. Maybe you can do better, and coalesce records even across commit records, but it seems hard to get it right.

Sep 14 '22 08:09 hlinnaka

I guess, but makes it pretty limited.

I have only one scenario in my mind where such optimization can be useful: bulk loading. Index build is stored in WAL using FPI records which we already stored as images in pagestore. So we need to optimize only appending data to heap. And it is done only by INSERT/MULTIINSERT records, isn't it?

Hmm, I think it could cause confusion if have a read-only replica that fetches the page between those records.

I still do not completely understand what do we mean by read-only replicas. Is it just read-only snapshot (branch). Or is it master's follower which receives WAL from master node for invalidation? If replica is receiving pages from pageserver, then it can not get different content than master. Just because there is single instance of the page stored in pageserver which can be delivered both to master and replica compute nodes. And the fact that master really has different page image at the particular LSN than we sent to replica seems to be not important/critical.

The simplest safe way to do that that I can see is what I described in the earlier comment

I do not see role of commits in this heap-index consistency problem. Let look at your example:

HEAP_INSERT row 'foo' (transaction A)
BTREE_INSERT ptr for row 'foo' (transaction A)
Commit transaction B
HEAP_INSERT row 'bar' (transaction A)
BTREE_INSERT ptr for row 'bar'

The problem with heap-index consistency is present despite to whether we commit transaction B at step 3 and even presence of transaction B at all.

Sep 14 '22 08:09 knizhnik

I am sorry that my explanation are so large and may be unclear. I try to enumerate them briefly:

Presence of other transaction is not important because we can reply on postgres isolation mechanism.
Heap-index consistency problem can be solved by placing heap inserts before index updates.
If we collapse only subsequent WAL records affecting the same page, then we will use many possibile cases were heap updated are intermixed for example with FSM updates or independent updates performed by concurrent transactions (assume that we are inserting data in some large table while vacuum is processing other tables).
Page reconstruction can only be called in some background task and should not slow down wal receiver.

Sep 14 '22 08:09 knizhnik

Hmm, I think it could cause confusion if have a read-only replica that fetches the page between those records.

I still do not completely understand what do we mean by read-only replicas. Is it just read-only snapshot (branch). Or is it master's follower which receives WAL from master node for invalidation? If replica is receiving pages from pageserver, then it can not get different content than master. Just because there is single instance of the page stored in pageserver which can be delivered both to master and replica compute nodes.

I was thinking of a replica that follows the primary.

And the fact that master really has different page image at the particular LSN than we sent to replica seems to be not important/critical.

It gives me the jeebies. When the transaction later commits, you need to see the newer version of the page, so you've got to be careful to invalidate it. Yes, you can do that, but it gets harder to do that correctly, if there can be two different versions of a page with the same LSN in the system, such that one of them is actually later than the other one. I bet we will get confused sooner or later.

Let me walk through another example with a branch. Let's say you have this WAL stream:

LSN: record

0x01: HEAP_INSERT 'foo' at tuple slot 1 (transaction A) 0x02: BTREE_INSERT 1 0x03: COMMIT (transaction B) 0x04: HEAP_INSERT 'bar' at tuple slot 2 (transaction A) 0x05: BTREE_INSERT 2 0x06: COMMIT (transaction A)

Now create a branch at LSN 0x03. On the branch, you insert an unrelated tuple to slot 2. So from the branch, the history looks like this:

0x01: HEAP_INSERT 'foo' at tuple slot 1 (transaction A) [parent branch] 0x02: HEAP_INSERT 'FOOBAR' at tuple slot 2 [child branch]

Next, you coalesce records 0x01 and 0x04 on the parent branch, so that the parent history now looks like this:

0x01: HEAP_INSERT 'foo' at slot 1 AND 'bar' at slot 2 (transaction A) 0x02: BTREE_INSERT 1 0x03: COMMIT (transaction B) 0x05: BTREE_INSERT 2 0x06: COMMIT (transaction A)

As you said, that works OK on the parent branch, because PostgreSQL MVCC ensures that tuple 2 is invisible before LSN 0x06. However, since you "rewrote history", it doesn't work right from the child branch anymore. From the point of view of the child branch, the history now looks like this:

0x01: HEAP_INSERT 'foo' at slot 1 AND 'bar' at slot 2 (transaction A) [parent branch] 0x02: HEAP_INSERT 'FOOBAR' at tuple slot 2 [child branch]

Replaying the second record will fail, because there's already a tuple at slot 2.

Yeah, you could probably plug that hole too, by not doing the coalescing if a branch was created in between. But this feels like a game of whack-a-mole. I don't want to plug bugs like this one by one, by adding more conditions on when we can or cannot coalesce. This rewriting of history is too dangerous.

We can coalesce records when they arrive, before we make them visible to any branches or getpage requests. But as soon as we let someone see a page version with a particular LSN, we must always return the same image of the page to subsequent requests with the same LSN. Otherwise we will get confused and will experience be hard-to-find bugs.

Sep 14 '22 09:09 hlinnaka

I was thinking of a replica that follows the primary.

Frankly speaking I still do not sure how such replica can be maintained and do we need such replica at all. Should it receive WAL stream from master compute or from safekeepers? Should this WAL stream is used only for invalidation orwe can also use it to update local cache? Receiving data both as wAL stream and pages reconstructed by pageserver seems to be really strange: network traffic is increased twice! But it is separate topic. I hope we can discuss it at offsite and may be come to soe better understanding of purposes and maintenance of read-only replicas in Neon architecture. So lets better discuss possible problems with branching and collapsed WAL records.

Sep 14 '22 10:09 knizhnik

I understand your example with branching. But what s the role of "0x03: COMMIT (transaction B)" in this story? Looks like the some problem persists even if transaction B is not present al all.

As far as we can create branch post-factum, it is not possible to check if branching is done at the place where we check is records can be collapsed.

The actual source of the problem is that generated image is assigned first LSN, not last one as it was done originally) to preserve index-heap consistency. Right now I can not propose any solution: need to think more about it.

Sep 14 '22 10:09 knizhnik

Actually I wonder how we managed to insert record at slot 2 in child branch? Isn't marked as occupied in this history so we have to choose different slot for inserted record?

Sep 14 '22 10:09 knizhnik

Compute node at branch with LSN 3 starts with page image where slot 2 is occupied. So it should not try to insert any new record in this slot. May be I missed something but it seems to be that scenario you have described is not possible.

Actually what is the analog of branching in vanilla postgres? It is standard crash-recover procedure. Assume that crash happens at LSN=3 and recovery is initiated. Correctness of recovery in Postgres is based on proper order of replaying WAL. But collapsing several WAL records into one page image we definitely breaks this order. The question is whether it is critical/fatal or not. I think that performing collapsing only for insert records and placing heap updates prior to any other updates makes this procedure correct. I can not prove it, but I do not see any scenario contradicting it.

Sep 14 '22 12:09 knizhnik

Compute node at branch with LSN 3 starts with page image where slot 2 is occupied.

Not necessarily. The compute node on the child branch can start up before LSN 0x04 on the parent branch has even been generated yet.

Sep 14 '22 12:09 hlinnaka

Not necessarily. The compute node on the child branch can start up before LSN 0x04 on the parent branch has even been generated yet.

Yes, it can. But what can get wrong then? So we create new branch at LSN=3. Pageserver waits until LSN=3 is arrived and spawns new compute node. This compute node tries to download this page. Assume that in-mem layer is not yet checkpointed, so no collapsing is performed. So we reconstruct page containing just fist insert (slot 1). Then we store 'FOOBAR' in slot 2. And then pageserver perform checkpoint, collapse records 1 and 4 and stores this page image in parent timeline. So we have two versions of this page: in parent timeline containing 'bar' in stol 2 and in child timeline containing 'foobar' in slot 2. When we request this page i parent timeline, we get first version of the page; when in child timeline - second version of the page. I do not see any contradiction or problem here.Child timeline will never try to download first version of the page because there is successor version of this page with greater LSN.

And if record collapsing happens before child timeline requests with page, then 'foobar' will be allocated i slot 3. and once again - I do not see any problems here.

Sep 14 '22 12:09 knizhnik

Child timeline will never try to download first version of the page because there is successor version of this page with greater LSN.

Replaying the WAL record on the child timeline requires fetching the previous full image of the page. From the parent timeline, in this case.

Sep 14 '22 12:09 hlinnaka

Replaying the WAL record on the child timeline requires fetching the previous full image of the page. From the parent timeline, in this case.

Sure. This page image may contain just first insert.Or it can contain both inserts depending on whether checkpoint happens before this page is requested. So content of the page will be different. But no error should happen. Or I am missing something?

Sep 14 '22 13:09 knizhnik

Child timeline will never try to download first version of the page because there is successor version of this page with greater LSN.

Replaying the WAL record on the child timeline requires fetching the previous full image of the page. From the parent timeline, in this case.

Now I understand what you mean:( Branch compute node may request page image from the parent node twice: first time it contains only slot 1 and after collapsed sot 1 and 2 are used. So it is not safe to assign to the generated image LSN of first collapsed record. Only last one. But in this case we may have problem with invid tids in index pages.

So now I try to understand your proposal:

The safe way to do this is to have a global view of the whole WAL stream, and only coalesce records between commit records.

But I still do not understand why do you concern about commit records only. Assume that there is just one transaction:

0x01: HEAP_INSERT 'foo' at tuple slot 1 0x02: BTREE_INSERT 1 0x03: HEAP_INSERT 'bar' at tuple slot 2 0x04: BTREE_INSERT 2

we still can not collapse 1 and 3. If created image will be assign LSN 1, then we can get potential problem with branch on LSN 3. If it is assigned LSN 3, then B-Tree page created at LSN 2 will contain invalid TID.

So looks like the rule is that can collapse only subsequent WAL records, isn't it? To check it we do not need to observe global WAL stream. We know length of WAL records and their LSNs, so we can check that there is no gap between them (taken in account record alignment and page/segment boundaries).

Another simple solution will be not to try to collapse WAL records, but just materialize MULTIINSERT records. It will be enough to handle case of COPY. With restriction that we can collapse only subsequent WAL records, there are actually not so much other scenarios when record collapsing can be done. If we just perform normal inserts in the table with indexes, then this inserts will be intermixed with index updates. And tables without index is not so frequent use case.

what do you think?

Sep 15 '22 09:09 knizhnik

But I still do not understand why do you concern about commit records only. Assume that there is just one transaction:

0x01: HEAP_INSERT 'foo' at tuple slot 1 0x02: BTREE_INSERT 1 0x03: HEAP_INSERT 'bar' at tuple slot 2 0x04: BTREE_INSERT 2

we still can not collapse 1 and 3. If created image will be assign LSN 1, then we can get potential problem with branch on LSN 3. If it is assigned LSN 3, then B-Tree page created at LSN 2 will contain invalid TID.

Since there are no commit records, you can move forward the LSNs on all of the above records, so that you have this:

0x04: HEAP_INSERT 'foo' at tuple slot 1 0x04: BTREE_INSERT 1 0x04: HEAP_INSERT 'bar' at tuple slot 2 0x04: BTREE_INSERT 2

If you do that early, when the records ingested, you can later coalesce all records with the same LSN that affect the same page.

Sep 15 '22 10:09 hlinnaka

Assigning same ID to WAL records seems to be tricky to me. And actually it doesn't solve the problem: we need to enforce that heap pages are updated prior to index pages.We can permutate LSNs in thisway to achive it:

0x01: HEAP_INSERT 'foo' at tuple slot 1 0x03: BTREE_INSERT 1 0x02: HEAP_INSERT 'bar' at tuple slot 2 0x03: BTREE_INSERT 2

But still I find this dangerous... You example with branching - how we will handle for example branch at LSN=3 ?

Sep 15 '22 12:09 knizhnik

I continue my experiments with collapsing records and page reconstruction cost. The results above shows that it has quite noticeable impact on query execution time: about 30% for bulk inserts (pgbench -i) and more than 3 times for simple inserts (generate_series). I expect that it will also have large influence on sequential scan time. Especially take n in the account that pgbench -i -s 100 doesn't create enough delta layers (6) to trigger compaction (and image layers generation). But it is not the case! This PR allows to replace MULTIINSERT wal records with page images. But there is almost the same number of "VISIBLE cutoff" records (one per page) which has to be replayed by wal redo. As a result from ~12 seconds of query execution time we spent ~4 seconds in walredo. And collapsing wal records can not help here: just because there is just one record per page. It means that we should really think about eager page materialization.

Sep 15 '22 15:09 knizhnik

Looks like cost of page reconstruction is not so high as I expected. I have implemented version that replaces pairs of MULTI_INSERT+INIT and VISIBLE records with page images. It is not so trivial because COPY in pgbench -i first inserts MULTI_INSERT+INIT records and only then vacuum adds VISIBLE records. So in the WAL there is long sequence of MULTI_INSERT+INIT followed by VISIBLE records.Them do not fit all in one layers so we can not "merge" them. The problem can be solved by using "COPY WITH FREEZE". In this case WAL really contains pairs of MULTI_INSERT+INIT and VISIBLE records and almost all of them can be replaced with pages images. Bur achieved results are not so exited. I first do pgbench -i -s 100 and then perform seqscan (select sum(abalance) from pgbench_accounts) and read-only pgbench with 10 clients. Results:

| pageserver | seqscan } pgbench -S | | main | 8056 msec | 12949 tps | | this PR | 5883 msec | 15457 tps |

So improvement is about 20%. Looks like reading page image from the disk is also quite expensive operation. Pef flamegraph is attached. pgbench

Sep 23 '22 06:09 knizhnik

As discussed moving to draft

Mar 21 '23 15:03 LizardWizzard

neon neon copied to clipboard

Collapse WAL records done inside one transaction into page image

neon
neon copied to clipboard