neon
neon copied to clipboard
Epic: performance parity with vanilla on medium-sized databases (100-300GB)
- [ ] https://github.com/neondatabase/neon/issues/214
- [ ] fine-grained evicted LSN tracking https://github.com/neondatabase/postgres/pull/167
- [ ] https://github.com/neondatabase/neon/pull/1793
- [ ] https://github.com/neondatabase/neon/issues/1763
I tried to figure out where we are spending most of the time during uploads of huge amount of data. There are two obviously expected bottlenecks:
- compute<->pageserver communication overhead
- Page reconstruction cost at pageserver First once can be addressed using prefetch. For second - we need some more clever strategist for earlier page materialization.
But results I got are little bit confusing for me. First of all it is necessary to notice that I have performed experiments on my laptop. And size of database is ~10Gb (pgebnch -s 1000, not ~100Gb. We have several times discussed that we need some infrastructure for such tests, there is even separate epic for it. But I prefer to have just some VM (or se several VMs) where it is possible to lanuch Neon manually (without any deploy) and have easy access to any process by debugger and profiler.
May be I want something strange, but t least for me it will be much more convenient to perform some performance experiments if I will have dedicated environment for it.
ln any case right now I have performed only local experiments. And this results can be skewed comparing with real server environment because network latency and SSD speed.
So I have performed pgbench -s -s 1000
with wal_log_hints=off
and shared_buffers=128MB
.
Results are the following:
Vanilla:
done in 378.62 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 217.64 s, vacuum 79.04 s, primary keys 81.94 s).
Neon:
done in 571.23 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 206.72 s, vacuum 195.64 s, primary keys 168.86 s).
Neon without page reconstruction overhead (I just patch pageserver to take pages directly from files in vanilla pgdatadir):
done in 583.15 s (drop tables 0.01 s, create tables 0.03 s, client-side generate 274.95 s, vacuum 149.43 s, primary keys 158.73 s).
Neon without page reconstruction and with prefetch enabled:
done in 568.17 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 288.80 s, vacuum 116.83 s, primary keys 162.52 s).
What we can see here?
- Difference between vanilla and Neon is not so dramatic here.
- Looks like page reconstruction doesn't have significant impact on performance.
- Prefetching doesn't help much.
So not so large (<2 times) difference between vanilla and neon can not be reduced using smarter page materialization algorithm or prefetching. Certainly there is open question concerning efficiency of prefetching implementation. as well as this simulation of materialized pages. Profilng doesn't show something interesting. Pageserver is spending time in flushing (fsync), and reading data (from files). CPU is not completely used, just because most of the time two processes are working: postgres backend and pageserver. But index creation spawns extra parallel worker to load and sort tuples by key,
Difference between vanilla and Neon is not so dramatic here.
Interesting. Can you try to reproduce staging/perf settings in your setup? E.g., set precisely the same settings as with staging (you can start a node in perf env and dump pg_settings()
there) and probably restrict compute node for one CPU -- I think it is possible to do it via cgroups (https://www.cybertec-postgresql.com/en/linux-cgroups-for-postgresql/)
Will do.
But I do not think that any of parameters except shared_buffers
, wal_log_hints
and max_replication_write_lag
can significantly affect performance.
I tried to set no_sync=true
for safekeepers but it doesn't significantly change results.
Concerning limiting number of CPUs - this workload in any case doesn't load all cores. Without "avoid page reconstruction hack" there are three processing consuming CPU:
pageserver: 50-90% CPU
backend: 30-50% CPU
wal_redo: 20-30% PU
Profiling with "Monte Carlo" approach (random attach of gdb and inspecting stack) shows that vacuum of the time spends in read_page_at_lsn
, pageserver performing compaction and compaction thread is blocked in fsync.
I still think that slow laptop SSD may cause different picture than on server. This is why I wanted to run test on tmpfs, to avoid disk write overhead.. But size of data stor by pgbench -i -s 1000
- 34Gb doesn't allow to do it at my laptop.
Also I failed to find reference to the original results of comparison neon vs. vanilla on large data set. Are them present in some ticket?
In any case, vaccum takes less than 1/3 of total time of pgbench initialization phase. Even if we completely eliminate it (which is certainly not possible) we will increase speed not more than 30%.
Index construction is using parallel read from logical tapes and prefetch right now is not used here. In principle, it can be also added for logical tapes rea, but parallel read from multiple workers makes reduces effect of prefetch.
I found the problem in my new prefetch implementation which cause redundant page loads. But still prefetch is not able to significantly reduce database population time. Let's see:
Measurement | Time (sec) |
---|---|
vanilla vacuum time | 79 |
neon vacuum time (w/o prefetch) | 258 |
neon vacuum time (with prefetch) | 168 |
get_page total time at pageserver | 271 |
wal_redo total time | 129. |
zenith_read total wait time | 298 |
As you can notice the difference in vacuum time with/without prefetch is ~ the same as vanilla vacuum time. It means that prefetch really works and allows to perform page loading and their processing in parallel.
But still it doesn't significantly change total time of pgbench -i
which seems to be omstly determine now by backpressure settings.
Actually we are at the same position here as two month before when I first tried to adjust backpressure parameters.
I have found that decreasing max_replication_write_lsg
from 500MB to 10Mb almost two times increase time of pgbench database initialization. This is why I start looking toward smarter throttling algorithm.
Then we have implemented various tests and benchmarks for backpressure. And ... this tests don't reproduce this problem: them do not show correlation between max_replication_write_lsg
and insertion speed.
But right now, performing pgbench -i -s 1000
I still observe the same problem.
Time of database initialization with max_replication_write_lsg=15MB
(as currently used at staging) is significantly larger (888 vs 579 seconds) than with max_replication_write_lsg=500MB
Should I resurrect this topic with throttling algorithm choice?
More investigations on where time is spent. Now just consider sequential scan: it is simpler than monitoring vacuum stage of pgbench initialization:
process | operation | time (sec) |
---|---|---|
compute backend | smgr read wait time | 58 |
[ageserver | get_page_at_lsn | 78 |
wal_redo postgres | postgres | 37 |
Query execution time:
seqscan (neon): 99
seqscan (vanilla): 44
so it illustrates that pageserver is just too slow: it is retreiving pages much slower than compute is processing them. This is why prefetch doesn't hep much: it allows to run compute and pageserver in parallel, but still query execution time will be maximum of their times.
Why pageserver is so slow? Profile is the following:
4,68% serving Page Se libc.so.6 [.] __memmove_avx_unaligned_erms ◆
4,35% serving Page Se [kernel.kallsyms] [k] copy_user_enhanced_fast_string ▒
4,34% serving Page Se pageserver [.] core::ops::range::Range<Idx>::contains ▒
4,07% serving Page Se pageserver [.] pageserver::layered_repository::layer_map::LayerMap::search ▒
2,85% serving Page Se pageserver [.] <pageserver::layered_repository::delta_layer::DeltaLayer as pageser▒
2,69% serving Page Se pageserver [.] <pageserver::layered_repository::image_layer::ImageLayer as pageser▒
1,77% serving Page Se pageserver [.] pageserver::page_cache::PageCache::try_lock_for_read ▒
1,73% serving Page Se [kernel.kallsyms] [k] syscall_exit_to_user_mode ▒
1,70% serving Page Se libc.so.6 [.] pthread_rwlock_rdlock@GLIBC_2.2.5 ▒
1,63% serving Page Se [kernel.kallsyms] [k] psi_group_change ▒
1,47% serving Page Se libc.so.6 [.] _int_malloc ▒
1,37% serving Page Se [kernel.kallsyms] [k] clear_page_erms ▒
1,09% serving Page Se libc.so.6 [.] pthread_rwlock_unlock@@GLIBC_2.34 ▒
1,03% serving Page Se libc.so.6 [.] __memcmp_avx2_movbe ▒
0,97% serving Page Se libc.so.6 [.] _int_free ▒
Looks like most of the time we ae copying data between system cache and user process. Also lookup in layer map is quite expensive for such number of layers (scale 1000 = ~10GB database)
Using R-tree index doesn't help much.
Looks like pageserver is limited not by CPU, but by speed of reading data from the disk. According to prometheus metrics, about 9 seconds we spent in read() - it is from 72 seconds in get_page_at_lsn
and ~150 seconds of query execution (it's without prefetch).
According to perf:
36% is spent in get_page_at_lsn
, 20% in wal-redo, 12% in flle read, 9% in socket read, 18% i socket write.
It is consistent with the prometheues metrics is we take i account that wal-redo time should be also added to get_page_at_lsn
.
So about half of query execution time pageserver is extracting pages, where 20% is spent in wal-redo, and~10% in disk IO. About ~30% in spent in network IO.
I wonder whether we can make some kind of brainstorming concerning future directions of Neon performance improvements? Right now I know the following strategies (most of which I have already tested):
- Local file cache
- provide performance comparable with Vanilla.
- quite simple to implement and o not affect postgres code
- do not solve startup problem (when cache is empty)
- Prefetch
- Up to 2 times improvement of seqscan speed.
- Uses standard postgres prefetch mechanism
- With parallel scan improvement is mush less: ~10%
- Requires extension of SMGR API
- Further improvement is limited by pageserver speed: what ideal prefetch can do it let compute and pageserver work in parallel, but if pageserver speed is less than compute speed, query time will in any case be limited by first one.
- Large or persistent disk page cache at pageserer
- quite simple to implement
- doesn't solve problem of first access to the page: it still be slow
- Some advanced strategies to force page materialization at page server. For example, add page eviction callback to buffer manager and inform page server when page is thrown away fro compute cache. In this case we can start reconstruction of this page as pageserver. The question is where to store reconstructed page? Page cache? Image layer (need to materialize all pages) ? Add image to delta layers?
- Smarter strategy for page compaction. Compaction is very IO intensive operation and may have significant impact on overall system performance. Right now it is triggered just by number of L0 layers. But if some page is updated just in one of this layers, then reshuffling them has no sense. We ca measure "interation" between layers and perform compaction only if exceeds some threshold.
- Smarter throttling policy for backpressure. Right now it is just "stop and wait". 500MB
max_relication_write_lag
is too large (cause minutes delays). And with 15Mbmax_relication_write_lag
time of pgbenh database population is about 30-50% slower. - Increase chunk of data exchange with pageserver. It can be done just by increasing Postgre page size fro 8Kb to something larger. Or preserve original page size, but use larger chunks (i.e. 1Mb) for retrieving data from pageserver. Them ca be store in local file cache. I have tried last option long time ago, but it has almost no positve impact on performance.
I missed something? I didn't mention sharding here, because it seems to be separate topic.
I wonder whether we can make some kind of brainstorming concerning future directions of Neon performance improvements?
+1, created a call tomorrow
Right now I know the following strategies (most of which I have already tested):
Looks like we will need to implement prefetching anyway, regardless of pageserver changes. There are inherent network and context switch and CPU scheduling latencies that we need to address anyway. So I would try to clean it up, review and commit.
The same with https://github.com/neondatabase/postgres/pull/191 I guess
Note/observation:
of the time spent in handle_get_rel_page_at_lsn (36.05%),
- 43.5% is spent in ImageLayer::get_value_reconstruct_data
- 19.1% is spent in tracing
- 13.6% is spent in LayerMap::search
- 10.3% is spent in DeltaLayer::get_value_reconstruct_data
The split in time spent between ImageLayer / DeltaLayer is very surprising to me when I consider the large amount of time we spend in the redo handler thread sending data to/from the PG redo process (20.17%).
Last written LSN cache and fast index for layer map (#2310 - using R-Tree) are IMOHO must have features. May be not exactly this implementation which I proposed (for example I am not sure that using rstar implementation of R-Tree which several hacks to make it correctly work with 128-bit integers is the best solution), but something like this we definitely need to have.
Note/observation:
of the time spent in get_rel_page_at_lsn (36.05%),
* 43.5% is spent in ImageLayer::get_value_reconstruct_data * 19.1% is spent in tracing * 13.6% is spent in LayerMap::search * 10.3% is spent in DeltaLayer::get_value_reconstruct_data
The split in time spent between ImageLayer / DeltaLayer is very surprising to me when I consider the large amount of time we spend in the redo handler thread sending data to/from the PG redo process (20.17%).
How and where this results were produced? 19% in tracing seems to be something abnormal. Is it prometheus metrics or just logging?
But m measurements also show that we spent a lot o time in copying memory between system and user space as well as within user space plus memory allocation is lso quite expensive.
For comparison, here's the flamegraph we're collecting as part of the 'benchmarks' run, from get_page_at_lsn() requests. This is a wallclock-time flamegraph, so it includes all the time spent from the moment a get_page_at_lsn() request is received from the client until the response has been sent, including all "non-CPU" time spent waiting on the WAL-redo process and time spent waiting on locks.
You can find this flamegraph from our CI runs, although it's pretty well hidden:
- Go to github actions and find the latest "Test and Deploy" workflow for 'main' branch. For example, https://github.com/neondatabase/neon/actions/runs/2903113533. (Note that this is not the Benchmarking workflow; that's different).
- Scroll down to Summaries, and expand "benchmarks (release, 1.58) summary"
- Download the
python-test-performance-Linux-release-1.58-logs.tar.zst
file - Unarchive, the flamegraph is in
./test_pgbench_flamegraph[45-10]/repo/flamegraph.svg
This flamegraph also shows "tracing" taking 1.40% of the overall time. That seems roughly similar to what @MMeent observed, when you account for the on-CPU vs wallclock nature.
That flamegraph we collect as part of the CI is for pgbench, including initialization but also running the actual test twice for 45 seconds, in simple-update mode and in select-only mode. So it's not the same workload as just the initialization step. We could easily add a flamegraph of just the initialization step there.
You can find this flamegraph from our CI runs, although it's pretty well hidden:
@bayandin, ideas on how we could make that more prominent? Can we add it to the allure report somehow?
How and where this results were produced?
These are pulled from your SVG file, by selecting handle_get_rel_page_at_lsn and then searching for ^.*ImageLayer.*get_value_reconstruct_data
, tracing
, ^.*LayerMap.*search
and ^.*DeltaLayer.*get_value_reconstruct_data
respectively; then noting down the %age listed in the bottom right corner (which is the relative percentage of your search result within the current selection).
So these %s are %-time-spent-in-handle_get_rel_page_at_lsn, not overall runtime.
The 20.17 of the redo handler process was global, not within that selection.
This flamegraph also shows "tracing" taking 1.40% of the overall time. That seems roughly similar to what @MMeent observed, when you account for the on-CPU vs wallclock nature.
Sorry, but how 19% of of 36% can be the same as 1.4%?
@bayandin, ideas on how we could make that more prominent? Can we add it to the allure report somehow?
Sure! I haven't added Allure reports to performance tests yet. Will do work on it this week and will make sure flamegraphs are attached
I'm getting 8.4% tracing from the flamegraph that you submitted here
Sorry, but how 19% of of 36% can be the same as 1.4%?
That's the difference between PERF flamegraphs which usually counts instructions, vs the benchmark flamegraphs which counts wall-clock time spent (which accounts for for memory stalls, IO ops, etc).
As such, the shapes of the graphs are quite different between @knizhnik 's graph and the one provided by @hlinnaka
Meeting notes:
- use captest env for tests -- @kelvich to provide an instructions
- commit lsn cache patch -- wais for tests fox -- @knizhnik
- polish prefetch patch -- waits for a review -- @lubennikovaav
- try to load few gb of pgbench init WAL in the pageserver and compare numbers -- @kelvich
- play with image layer materialization frequency -- @bojanserafimov
test:
time pgbench -i -s 100 'postgres://still-sunset-471531.cloud.neon.tech:5432/main'
time pgbench -i -s 100 localhost
I played around with this a bit today. Some observations:
pgbench roundtrips
pgbench -i
uses COPY to initialize pgbench_accounts
, which is the largest table. However, it uses single-row INSERT statements for all the smaller tables. With scale 100, the smaller tables contain 1100 rows in total.
If you run pgbench on your laptop, against a server in the cloud, that means 1100 roundtrips. If you're testing from Europe to a server that's in the US, the latency adds up. I'm seeing about 100ms latency from my home to the staging cluster. That accounts for 110 seconds of total time.
You can apply this patch: reduce-pgbench-init-roundtrips.patch.txt to fix that issue. It replaces the single-row INSERTs with just two statements that use generate_series() to generate the same data.
logged vs unlogged tables
After fixing that, I compared initializing with unlogged tables vs logged. Unlogged tables in Neon are stored on compute-local disk, so that eliminates the safekeepers and pageservers from the picture.
$ pgbench --unlogged-tables -i -s100 postgres://[email protected]:5432/main
dropping old tables...
creating tables...
generating data (client-side)...
10000000 of 10000000 tuples (100%) done (elapsed 17.56 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 44.48 s (drop tables 0.11 s, create tables 0.47 s, client-side generate 18.41 s, vacuum 9.34 s, primary keys 16.15 s).
$ pgbench -i -s100 postgres://[email protected]:5432/main
dropping old tables...
creating tables...
generating data (client-side)...
10000000 of 10000000 tuples (100%) done (elapsed 24.80 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 171.29 s (drop tables 0.18 s, create tables 0.76 s, client-side generate 25.91 s, vacuum 41.54 s, primary keys 102.90 s).
So now there is not that much difference in the "client-side generate" phase, 25.91 s vs 18.41 s. That suggests that the main problem is not with the WAL digestion speed. I also looked at pg_backpressure_lsns()
while the COPY was running, and didn't see much lag.
Vacuum and index creation seem to be the slow-pokes now.
pgbench roundtrips
Sorry, may be I missed something.
Yes, pgbench is issuing separate inserts for population of small tables.
But all of them are performed in one transaction, aren't them?
If so, why we need separate roundtrip for each insert?
Insert just add record to WAL, WAL should be spent by 128kb chunks to safekeepers and then to pageserver. It should e almost the same as insert into ... select (generate_timeseries(...))
which is slower than COPY but still fast enough.
@hlinnaka
You can apply this patch: reduce-pgbench-init-roundtrips.patch.txt to fix that issue.
Looks like this patch turns initGenerateDataClientSide
into initGenerateDataServerSide
. Am I correct? If so it can be enabled without the patch by passing --init-steps dtGvp
to pgbench directly. Capital G stands for server side generate, other letters match default values. It helped me a lot to reduce total initialization time while running against compute located in US.
@knizhnik
If so, why we need separate roundtrip for each insert?
It is all about roundtrips between client (pgbench) and compute. Init is indeed performed in a transaction, but each query is sent separately with its own PQexec which is blocking and waits for the query to complete. So there will be separate roundtrip for each of nbranches
and ntellers
. If compute you're talking to is far away latency of these calls will take significant part of the total initialization time
Looks like this patch turns
initGenerateDataClientSide
intoinitGenerateDataServerSide
. Am I correct?
Almost. Yes, it turns the initialization of small tables to be equal to initGenerateDataServerSide
. That's where I copy-pasted it from :-). But there's still a difference in how it initializes the big pgbench_accounts
table. In initGenerateDataClientSide
, it streams the data to the server with COPY. In initGenerateDataServerSide
, it generates the data completely in the server.
I'm not sure if that difference is significant in our case. I think initGenerateDataServerSide
might materialize the whole dataset in a temp file first, and load it into the table only after that, but I haven't double-checked that.
If so it can be enabled without the patch by passing
--init-steps dtGvp
to pgbench directly. Capital G stands for server side generate, other letters match default values. It helped me a lot to reduce total initialization time while running against compute located in US.
Yep, that's certainly easier than patching pgbench.
It is all about roundtrips between client and compute.
Sorry, it was stupid comment from my side. Certainly generation of data at server side is the fastest solution. But it works mostly for synthetic data (like in pgbench). It is not applicable for real data, coming from different data sources. In some cases COPY will help. But if application logic requires to issue separate SQl command for each row, then we have the same problem as with pgbench but without -G option.
The most obvious solution is move client to the same cloud too. But it is not always possible.
Another approach to reduce latency: support multiple regions.
The general solution for slow network is to use pipeline mode: https://www.cybertec-postgresql.com/en/pipeline-mode-better-performance-on-slow-network/ but it requires changes of application code.
It can be avoided by setting uop client-side proxy, which simulates "pipeline mode", by immediately sending Ok response. I wonder if such proxues are already implemented and used by somebody...
Vacuum and index creation seem to be the slow-pokes now.
It will be nice to try it with prefetch:
https://github.com/neondatabase/neon/pull/2369
Locally at my laptop I do not see any noticeable effect of prefetch on pgbench -i
time.
It is a little bit strange that in in my case time of autovacuum is about two times larger than time of building index and in your case ratio is opposite.
Please notice that with default parameters (wal_log_hints=on) and last written lsn cache, prefetch is almost useless because frequent advances of last_written_lsn
cause deterioration of prefetched pages (tehr LSN is smaller than requested).
With wal_log_hits=off
results of pgbench -i -s 100
at my laptop are the following:
prefetch | pgbench -i -s 100 |
---|---|
off | done in 57.96 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 23.19 s, vacuum 20.95 s, primary keys 13.81 s) |
on | done in 49.25 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 24.05 s, vacuum 15.09 s, primary keys 10.10 s) |
Another comparison:
Neon on staging environment:
~$ pgbench -i -s 500
dropping old tables...
creating tables...
generating data (client-side)...
50000000 of 50000000 tuples (100%) done (elapsed 193.72 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 598.78 s (drop tables 0.10 s, create tables 0.42 s, client-side generate 196.27 s, vacuum 24.22 s, primary keys 377.76 s).
On a different widely-used cloud provider, with smallest possible instance size:
~$ pgbench -i -s 500
dropping old tables...
creating tables...
generating data (client-side)...
50000000 of 50000000 tuples (100%) done (elapsed 176.07 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 211.84 s (drop tables 0.17 s, create tables 0.70 s, client-side generate 177.64 s, vacuum 0.87 s, primary keys 32.46 s).
Index creation is the elephant in the room here. And vacuum, to a lesser extent. I'd love to repeat this with the prefetching patch. Can we get that into staging with some feature flag, or to the cap test environment?