neon Epic: performance parity with vanilla on medium-sized databases (100-300GB)

[ ] https://github.com/neondatabase/neon/issues/214
[ ] fine-grained evicted LSN tracking https://github.com/neondatabase/postgres/pull/167
[ ] https://github.com/neondatabase/neon/pull/1793
[ ] https://github.com/neondatabase/neon/issues/1763

Aug 05 '22 10:08 kelvich

I tried to figure out where we are spending most of the time during uploads of huge amount of data. There are two obviously expected bottlenecks:

compute<->pageserver communication overhead
Page reconstruction cost at pageserver First once can be addressed using prefetch. For second - we need some more clever strategist for earlier page materialization.

But results I got are little bit confusing for me. First of all it is necessary to notice that I have performed experiments on my laptop. And size of database is ~10Gb (pgebnch -s 1000, not ~100Gb. We have several times discussed that we need some infrastructure for such tests, there is even separate epic for it. But I prefer to have just some VM (or se several VMs) where it is possible to lanuch Neon manually (without any deploy) and have easy access to any process by debugger and profiler.

May be I want something strange, but t least for me it will be much more convenient to perform some performance experiments if I will have dedicated environment for it.

ln any case right now I have performed only local experiments. And this results can be skewed comparing with real server environment because network latency and SSD speed.

So I have performed pgbench -s -s 1000 with wal_log_hints=off and shared_buffers=128MB. Results are the following:

Vanilla:

done in 378.62 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 217.64 s, vacuum 79.04 s, primary keys 81.94 s).

Neon:

done in 571.23 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 206.72 s, vacuum 195.64 s, primary keys 168.86 s).

Neon without page reconstruction overhead (I just patch pageserver to take pages directly from files in vanilla pgdatadir):

done in 583.15 s (drop tables 0.01 s, create tables 0.03 s, client-side generate 274.95 s, vacuum 149.43 s, primary keys 158.73 s).

Neon without page reconstruction and with prefetch enabled:

done in 568.17 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 288.80 s, vacuum 116.83 s, primary keys 162.52 s).

What we can see here?

Difference between vanilla and Neon is not so dramatic here.
Looks like page reconstruction doesn't have significant impact on performance.
Prefetching doesn't help much.

So not so large (<2 times) difference between vanilla and neon can not be reduced using smarter page materialization algorithm or prefetching. Certainly there is open question concerning efficiency of prefetching implementation. as well as this simulation of materialized pages. Profilng doesn't show something interesting. Pageserver is spending time in flushing (fsync), and reading data (from files). CPU is not completely used, just because most of the time two processes are working: postgres backend and pageserver. But index creation spawns extra parallel worker to load and sort tuples by key,

Aug 16 '22 10:08 knizhnik

Difference between vanilla and Neon is not so dramatic here.

Interesting. Can you try to reproduce staging/perf settings in your setup? E.g., set precisely the same settings as with staging (you can start a node in perf env and dump pg_settings() there) and probably restrict compute node for one CPU -- I think it is possible to do it via cgroups (https://www.cybertec-postgresql.com/en/linux-cgroups-for-postgresql/)

Aug 16 '22 10:08 kelvich

Will do. But I do not think that any of parameters except shared_buffers, wal_log_hints and max_replication_write_lag can significantly affect performance.

I tried to set no_sync=true for safekeepers but it doesn't significantly change results.

Concerning limiting number of CPUs - this workload in any case doesn't load all cores. Without "avoid page reconstruction hack" there are three processing consuming CPU:

    pageserver: 50-90% CPU
    backend:     30-50% CPU
    wal_redo:    20-30% PU

Profiling with "Monte Carlo" approach (random attach of gdb and inspecting stack) shows that vacuum of the time spends in read_page_at_lsn, pageserver performing compaction and compaction thread is blocked in fsync. I still think that slow laptop SSD may cause different picture than on server. This is why I wanted to run test on tmpfs, to avoid disk write overhead.. But size of data stor by pgbench -i -s 1000 - 34Gb doesn't allow to do it at my laptop.

Also I failed to find reference to the original results of comparison neon vs. vanilla on large data set. Are them present in some ticket?

Aug 17 '22 07:08 knizhnik

In any case, vaccum takes less than 1/3 of total time of pgbench initialization phase. Even if we completely eliminate it (which is certainly not possible) we will increase speed not more than 30%.

Index construction is using parallel read from logical tapes and prefetch right now is not used here. In principle, it can be also added for logical tapes rea, but parallel read from multiple workers makes reduces effect of prefetch.

Aug 17 '22 07:08 knizhnik

I found the problem in my new prefetch implementation which cause redundant page loads. But still prefetch is not able to significantly reduce database population time. Let's see:

Measurement	Time (sec)
vanilla vacuum time	79
neon vacuum time (w/o prefetch)	258
neon vacuum time (with prefetch)	168
get_page total time at pageserver	271
wal_redo total time	129.
zenith_read total wait time	298

As you can notice the difference in vacuum time with/without prefetch is ~ the same as vanilla vacuum time. It means that prefetch really works and allows to perform page loading and their processing in parallel.

But still it doesn't significantly change total time of pgbench -i which seems to be omstly determine now by backpressure settings.

Aug 19 '22 09:08 knizhnik

Actually we are at the same position here as two month before when I first tried to adjust backpressure parameters. I have found that decreasing max_replication_write_lsg from 500MB to 10Mb almost two times increase time of pgbench database initialization. This is why I start looking toward smarter throttling algorithm. Then we have implemented various tests and benchmarks for backpressure. And ... this tests don't reproduce this problem: them do not show correlation between max_replication_write_lsg and insertion speed.

But right now, performing pgbench -i -s 1000 I still observe the same problem. Time of database initialization with max_replication_write_lsg=15MB (as currently used at staging) is significantly larger (888 vs 579 seconds) than with max_replication_write_lsg=500MB Should I resurrect this topic with throttling algorithm choice?

Aug 19 '22 12:08 knizhnik

More investigations on where time is spent. Now just consider sequential scan: it is simpler than monitoring vacuum stage of pgbench initialization:

process	operation	time (sec)
compute backend	smgr read wait time	58
[ageserver	get_page_at_lsn	78
wal_redo postgres	postgres	37

Query execution time:

seqscan (neon): 99
seqscan (vanilla): 44

so it illustrates that pageserver is just too slow: it is retreiving pages much slower than compute is processing them. This is why prefetch doesn't hep much: it allows to run compute and pageserver in parallel, but still query execution time will be maximum of their times.

Why pageserver is so slow? Profile is the following:

   4,68%  serving Page Se  libc.so.6             [.] __memmove_avx_unaligned_erms                                       ◆
   4,35%  serving Page Se  [kernel.kallsyms]     [k] copy_user_enhanced_fast_string                                     ▒
   4,34%  serving Page Se  pageserver            [.] core::ops::range::Range<Idx>::contains                             ▒
   4,07%  serving Page Se  pageserver            [.] pageserver::layered_repository::layer_map::LayerMap::search        ▒
   2,85%  serving Page Se  pageserver            [.] <pageserver::layered_repository::delta_layer::DeltaLayer as pageser▒
   2,69%  serving Page Se  pageserver            [.] <pageserver::layered_repository::image_layer::ImageLayer as pageser▒
   1,77%  serving Page Se  pageserver            [.] pageserver::page_cache::PageCache::try_lock_for_read               ▒
   1,73%  serving Page Se  [kernel.kallsyms]     [k] syscall_exit_to_user_mode                                          ▒
   1,70%  serving Page Se  libc.so.6             [.] pthread_rwlock_rdlock@GLIBC_2.2.5                                  ▒
   1,63%  serving Page Se  [kernel.kallsyms]     [k] psi_group_change                                                   ▒
   1,47%  serving Page Se  libc.so.6             [.] _int_malloc                                                        ▒
   1,37%  serving Page Se  [kernel.kallsyms]     [k] clear_page_erms                                                    ▒
   1,09%  serving Page Se  libc.so.6             [.] pthread_rwlock_unlock@@GLIBC_2.34                                  ▒
   1,03%  serving Page Se  libc.so.6             [.] __memcmp_avx2_movbe                                                ▒
   0,97%  serving Page Se  libc.so.6             [.] _int_free                                                          ▒

Looks like most of the time we ae copying data between system cache and user process. Also lookup in layer map is quite expensive for such number of layers (scale 1000 = ~10GB database)

Aug 19 '22 12:08 knizhnik

Using R-tree index doesn't help much. Looks like pageserver is limited not by CPU, but by speed of reading data from the disk. According to prometheus metrics, about 9 seconds we spent in read() - it is from 72 seconds in get_page_at_lsn and ~150 seconds of query execution (it's without prefetch).

Aug 19 '22 19:08 knizhnik

According to perf: seqscan

36% is spent in get_page_at_lsn, 20% in wal-redo, 12% in flle read, 9% in socket read, 18% i socket write. It is consistent with the prometheues metrics is we take i account that wal-redo time should be also added to get_page_at_lsn. So about half of query execution time pageserver is extracting pages, where 20% is spent in wal-redo, and~10% in disk IO. About ~30% in spent in network IO.

Aug 20 '22 06:08 knizhnik

I wonder whether we can make some kind of brainstorming concerning future directions of Neon performance improvements? Right now I know the following strategies (most of which I have already tested):

Local file cache
- provide performance comparable with Vanilla.
- quite simple to implement and o not affect postgres code
- do not solve startup problem (when cache is empty)
Prefetch
- Up to 2 times improvement of seqscan speed.
- Uses standard postgres prefetch mechanism
- With parallel scan improvement is mush less: ~10%
- Requires extension of SMGR API
- Further improvement is limited by pageserver speed: what ideal prefetch can do it let compute and pageserver work in parallel, but if pageserver speed is less than compute speed, query time will in any case be limited by first one.
Large or persistent disk page cache at pageserer
- quite simple to implement
- doesn't solve problem of first access to the page: it still be slow
Some advanced strategies to force page materialization at page server. For example, add page eviction callback to buffer manager and inform page server when page is thrown away fro compute cache. In this case we can start reconstruction of this page as pageserver. The question is where to store reconstructed page? Page cache? Image layer (need to materialize all pages) ? Add image to delta layers?
Smarter strategy for page compaction. Compaction is very IO intensive operation and may have significant impact on overall system performance. Right now it is triggered just by number of L0 layers. But if some page is updated just in one of this layers, then reshuffling them has no sense. We ca measure "interation" between layers and perform compaction only if exceeds some threshold.
Smarter throttling policy for backpressure. Right now it is just "stop and wait". 500MB max_relication_write_lag is too large (cause minutes delays). And with 15Mb max_relication_write_lag time of pgbenh database population is about 30-50% slower.
Increase chunk of data exchange with pageserver. It can be done just by increasing Postgre page size fro 8Kb to something larger. Or preserve original page size, but use larger chunks (i.e. 1Mb) for retrieving data from pageserver. Them ca be store in local file cache. I have tried last option long time ago, but it has almost no positve impact on performance.

I missed something? I didn't mention sharding here, because it seems to be separate topic.

Aug 22 '22 05:08 knizhnik

I wonder whether we can make some kind of brainstorming concerning future directions of Neon performance improvements?

+1, created a call tomorrow

Right now I know the following strategies (most of which I have already tested):

Looks like we will need to implement prefetching anyway, regardless of pageserver changes. There are inherent network and context switch and CPU scheduling latencies that we need to address anyway. So I would try to clean it up, review and commit.

The same with https://github.com/neondatabase/postgres/pull/191 I guess

Aug 22 '22 08:08 kelvich

Note/observation:

of the time spent in handle_get_rel_page_at_lsn (36.05%),

43.5% is spent in ImageLayer::get_value_reconstruct_data
19.1% is spent in tracing
13.6% is spent in LayerMap::search
10.3% is spent in DeltaLayer::get_value_reconstruct_data

The split in time spent between ImageLayer / DeltaLayer is very surprising to me when I consider the large amount of time we spend in the redo handler thread sending data to/from the PG redo process (20.17%).

Aug 22 '22 10:08 MMeent

Last written LSN cache and fast index for layer map (#2310 - using R-Tree) are IMOHO must have features. May be not exactly this implementation which I proposed (for example I am not sure that using rstar implementation of R-Tree which several hacks to make it correctly work with 128-bit integers is the best solution), but something like this we definitely need to have.

Aug 22 '22 11:08 knizhnik

Note/observation:

of the time spent in get_rel_page_at_lsn (36.05%),
* 43.5% is spent in ImageLayer::get_value_reconstruct_data

* 19.1% is spent in tracing

* 13.6% is spent in LayerMap::search

* 10.3% is spent in DeltaLayer::get_value_reconstruct_data
The split in time spent between ImageLayer / DeltaLayer is very surprising to me when I consider the large amount of time we spend in the redo handler thread sending data to/from the PG redo process (20.17%).

How and where this results were produced? 19% in tracing seems to be something abnormal. Is it prometheus metrics or just logging?

But m measurements also show that we spent a lot o time in copying memory between system and user space as well as within user space plus memory allocation is lso quite expensive.

Aug 22 '22 11:08 knizhnik

For comparison, here's the flamegraph we're collecting as part of the 'benchmarks' run, from get_page_at_lsn() requests. This is a wallclock-time flamegraph, so it includes all the time spent from the moment a get_page_at_lsn() request is received from the client until the response has been sent, including all "non-CPU" time spent waiting on the WAL-redo process and time spent waiting on locks.

flamegraph

You can find this flamegraph from our CI runs, although it's pretty well hidden:

Go to github actions and find the latest "Test and Deploy" workflow for 'main' branch. For example, https://github.com/neondatabase/neon/actions/runs/2903113533. (Note that this is not the Benchmarking workflow; that's different).
Scroll down to Summaries, and expand "benchmarks (release, 1.58) summary"
Download the python-test-performance-Linux-release-1.58-logs.tar.zst file
Unarchive, the flamegraph is in ./test_pgbench_flamegraph[45-10]/repo/flamegraph.svg

This flamegraph also shows "tracing" taking 1.40% of the overall time. That seems roughly similar to what @MMeent observed, when you account for the on-CPU vs wallclock nature.

That flamegraph we collect as part of the CI is for pgbench, including initialization but also running the actual test twice for 45 seconds, in simple-update mode and in select-only mode. So it's not the same workload as just the initialization step. We could easily add a flamegraph of just the initialization step there.

Aug 22 '22 12:08 hlinnaka

You can find this flamegraph from our CI runs, although it's pretty well hidden:

@bayandin, ideas on how we could make that more prominent? Can we add it to the allure report somehow?

Aug 22 '22 12:08 hlinnaka

How and where this results were produced?

These are pulled from your SVG file, by selecting handle_get_rel_page_at_lsn and then searching for ^.*ImageLayer.*get_value_reconstruct_data, tracing, ^.*LayerMap.*search and ^.*DeltaLayer.*get_value_reconstruct_data respectively; then noting down the %age listed in the bottom right corner (which is the relative percentage of your search result within the current selection).

So these %s are %-time-spent-in-handle_get_rel_page_at_lsn, not overall runtime.

The 20.17 of the redo handler process was global, not within that selection.

Aug 22 '22 12:08 MMeent

This flamegraph also shows "tracing" taking 1.40% of the overall time. That seems roughly similar to what @MMeent observed, when you account for the on-CPU vs wallclock nature.

Sorry, but how 19% of of 36% can be the same as 1.4%?

Aug 22 '22 12:08 knizhnik

@bayandin, ideas on how we could make that more prominent? Can we add it to the allure report somehow?

Sure! I haven't added Allure reports to performance tests yet. Will do work on it this week and will make sure flamegraphs are attached

Aug 22 '22 12:08 bayandin

I'm getting 8.4% tracing from the flamegraph that you submitted here

Aug 22 '22 12:08 MMeent

Sorry, but how 19% of of 36% can be the same as 1.4%?

That's the difference between PERF flamegraphs which usually counts instructions, vs the benchmark flamegraphs which counts wall-clock time spent (which accounts for for memory stalls, IO ops, etc).

As such, the shapes of the graphs are quite different between @knizhnik 's graph and the one provided by @hlinnaka

Aug 22 '22 12:08 MMeent

Meeting notes:

use captest env for tests -- @kelvich to provide an instructions
commit lsn cache patch -- wais for tests fox -- @knizhnik
polish prefetch patch -- waits for a review -- @lubennikovaav
try to load few gb of pgbench init WAL in the pageserver and compare numbers -- @kelvich
play with image layer materialization frequency -- @bojanserafimov

test:

time pgbench -i -s 100 'postgres://still-sunset-471531.cloud.neon.tech:5432/main'
time pgbench -i -s 100 localhost

Aug 23 '22 15:08 kelvich

I played around with this a bit today. Some observations:

pgbench roundtrips

pgbench -i uses COPY to initialize pgbench_accounts, which is the largest table. However, it uses single-row INSERT statements for all the smaller tables. With scale 100, the smaller tables contain 1100 rows in total.

If you run pgbench on your laptop, against a server in the cloud, that means 1100 roundtrips. If you're testing from Europe to a server that's in the US, the latency adds up. I'm seeing about 100ms latency from my home to the staging cluster. That accounts for 110 seconds of total time.

You can apply this patch: reduce-pgbench-init-roundtrips.patch.txt to fix that issue. It replaces the single-row INSERTs with just two statements that use generate_series() to generate the same data.

logged vs unlogged tables

After fixing that, I compared initializing with unlogged tables vs logged. Unlogged tables in Neon are stored on compute-local disk, so that eliminates the safekeepers and pageservers from the picture.

$ pgbench --unlogged-tables -i -s100  postgres://[email protected]:5432/main
dropping old tables...
creating tables...
generating data (client-side)...
10000000 of 10000000 tuples (100%) done (elapsed 17.56 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 44.48 s (drop tables 0.11 s, create tables 0.47 s, client-side generate 18.41 s, vacuum 9.34 s, primary keys 16.15 s).

$ pgbench -i -s100  postgres://[email protected]:5432/main
dropping old tables...
creating tables...
generating data (client-side)...
10000000 of 10000000 tuples (100%) done (elapsed 24.80 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 171.29 s (drop tables 0.18 s, create tables 0.76 s, client-side generate 25.91 s, vacuum 41.54 s, primary keys 102.90 s).

So now there is not that much difference in the "client-side generate" phase, 25.91 s vs 18.41 s. That suggests that the main problem is not with the WAL digestion speed. I also looked at pg_backpressure_lsns() while the COPY was running, and didn't see much lag.

Vacuum and index creation seem to be the slow-pokes now.

Aug 30 '22 13:08 hlinnaka

pgbench roundtrips

Sorry, may be I missed something. Yes, pgbench is issuing separate inserts for population of small tables. But all of them are performed in one transaction, aren't them? If so, why we need separate roundtrip for each insert? Insert just add record to WAL, WAL should be spent by 128kb chunks to safekeepers and then to pageserver. It should e almost the same as insert into ... select (generate_timeseries(...)) which is slower than COPY but still fast enough.

Aug 30 '22 13:08 knizhnik

@hlinnaka

You can apply this patch: reduce-pgbench-init-roundtrips.patch.txt to fix that issue.

Looks like this patch turns initGenerateDataClientSide into initGenerateDataServerSide. Am I correct? If so it can be enabled without the patch by passing --init-steps dtGvp to pgbench directly. Capital G stands for server side generate, other letters match default values. It helped me a lot to reduce total initialization time while running against compute located in US.

@knizhnik

If so, why we need separate roundtrip for each insert?

It is all about roundtrips between client (pgbench) and compute. Init is indeed performed in a transaction, but each query is sent separately with its own PQexec which is blocking and waits for the query to complete. So there will be separate roundtrip for each of nbranches and ntellers. If compute you're talking to is far away latency of these calls will take significant part of the total initialization time

Aug 30 '22 14:08 LizardWizzard

Looks like this patch turns initGenerateDataClientSide into initGenerateDataServerSide. Am I correct?

Almost. Yes, it turns the initialization of small tables to be equal to initGenerateDataServerSide. That's where I copy-pasted it from :-). But there's still a difference in how it initializes the big pgbench_accounts table. In initGenerateDataClientSide, it streams the data to the server with COPY. In initGenerateDataServerSide, it generates the data completely in the server.

I'm not sure if that difference is significant in our case. I think initGenerateDataServerSide might materialize the whole dataset in a temp file first, and load it into the table only after that, but I haven't double-checked that.

If so it can be enabled without the patch by passing --init-steps dtGvp to pgbench directly. Capital G stands for server side generate, other letters match default values. It helped me a lot to reduce total initialization time while running against compute located in US.

Yep, that's certainly easier than patching pgbench.

Aug 30 '22 16:08 hlinnaka

It is all about roundtrips between client and compute.

Sorry, it was stupid comment from my side. Certainly generation of data at server side is the fastest solution. But it works mostly for synthetic data (like in pgbench). It is not applicable for real data, coming from different data sources. In some cases COPY will help. But if application logic requires to issue separate SQl command for each row, then we have the same problem as with pgbench but without -G option.

The most obvious solution is move client to the same cloud too. But it is not always possible.

Another approach to reduce latency: support multiple regions.

The general solution for slow network is to use pipeline mode: https://www.cybertec-postgresql.com/en/pipeline-mode-better-performance-on-slow-network/ but it requires changes of application code.

It can be avoided by setting uop client-side proxy, which simulates "pipeline mode", by immediately sending Ok response. I wonder if such proxues are already implemented and used by somebody...

Aug 31 '22 11:08 knizhnik

Vacuum and index creation seem to be the slow-pokes now.

It will be nice to try it with prefetch: https://github.com/neondatabase/neon/pull/2369 Locally at my laptop I do not see any noticeable effect of prefetch on pgbench -i time. It is a little bit strange that in in my case time of autovacuum is about two times larger than time of building index and in your case ratio is opposite.

Aug 31 '22 13:08 knizhnik

Please notice that with default parameters (wal_log_hints=on) and last written lsn cache, prefetch is almost useless because frequent advances of last_written_lsn cause deterioration of prefetched pages (tehr LSN is smaller than requested).

With wal_log_hits=off results of pgbench -i -s 100 at my laptop are the following:

prefetch	pgbench -i -s 100
off	done in 57.96 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 23.19 s, vacuum 20.95 s, primary keys 13.81 s)
on	done in 49.25 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 24.05 s, vacuum 15.09 s, primary keys 10.10 s)

Aug 31 '22 14:08 knizhnik

Another comparison:

Neon on staging environment:

~$ pgbench -i -s 500
dropping old tables...
creating tables...
generating data (client-side)...
50000000 of 50000000 tuples (100%) done (elapsed 193.72 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 598.78 s (drop tables 0.10 s, create tables 0.42 s, client-side generate 196.27 s, vacuum 24.22 s, primary keys 377.76 s).

On a different widely-used cloud provider, with smallest possible instance size:

~$ pgbench -i -s 500 
dropping old tables...
creating tables...
generating data (client-side)...
50000000 of 50000000 tuples (100%) done (elapsed 176.07 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 211.84 s (drop tables 0.17 s, create tables 0.70 s, client-side generate 177.64 s, vacuum 0.87 s, primary keys 32.46 s).

Index creation is the elephant in the room here. And vacuum, to a lesser extent. I'd love to repeat this with the prefetching patch. Can we get that into staging with some feature flag, or to the cap test environment?

Aug 31 '22 20:08 hlinnaka

neon neon copied to clipboard

Epic: performance parity with vanilla on medium-sized databases (100-300GB)

pgbench roundtrips

logged vs unlogged tables

neon
neon copied to clipboard