iem
iem copied to clipboard
2021 Unidata Equipment Grant Omnibus
The IEM is :pray: to have received an equipment grant from Unidata. The grant is purchasing a Dell R7525 with NVMe drives, which I should be able to rule the world with its capacity :) The setup of the server will attempt to follow what Letsencrypt did.
The proposal outlined a number of deliverables, so this issue is an omnibus tracking these items and more.
- [ ] Remove or greatly relax the software throttles on the METAR, SHEF downloads.
- [ ] Create more web services to expose the SHEF archives.
- [ ] Add code support for various web services within siphon.
- [ ] Provide CONUS scale aggregates of climodat/COOP archives.
For my reporting benefit, a timeline of how things have progressed this far.
- 3 May 2021 - Unidata announced funding of proposal.
- 30 Jun 2021 - UCAR/ISU signed off on grant paperwork.
- 6 Jul 2021 - Workday tag created and ready for spending.
- 6 Jul 2021 - Dell purchase order submitted.
So this issue will collect up random things so to help my eventual delivery of a:
- [ ] blog post to Unidata.
Server was delivered late 19 July 2021 and setup in its final resting place on the morning of 20 July 2021. RHEL 8.4 was installed on the root RAID1 SSD 500GB. The first decision point is run kernel-ml
or not so to support my legacy Infiniband network. Lets try it and run a test to see if it is fast enough to matter. Moving a 5 GB empty file via scp
network | rate | time |
---|---|---|
ipoib | 327 MB/s | 15s |
iem1 | 111 MB/s | 46s |
iem0 | 112 MB/s | 46s |
- [x] gonna stick with
kernel-ml
as the infiniband network is likely useful to keep around.
Just to establish some crude baselines
mkfs.xfs /dev/nvme0n1
mount /dev/nvme0n1 /mnt/test
# time dd if=/dev/zero of=test.iso count=10M
10485760+0 records in
10485760+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 8.39105 s, 640 MB/s
real 0m8.392s
user 0m1.019s
sys 0m7.345s
hdparm -Tt /dev/nvme0n1
/dev/nvme0n1:
Timing cached reads: 21058 MB in 2.00 seconds = 10544.23 MB/sec
Timing buffered disk reads: 10270 MB in 3.00 seconds = 3423.18 MB/sec
- [x] Lets install Postgresql 13 to start benchmarking with it
Discussion with local ZFS expert SN
- [x] You can probably set the logbias to throughput
- [x] primarycache=metadata, since postgres maintains a cache
- [x] xattr=sa (and probably acltype=posixacl), atime=off and then decide if you even want relatime
- [x] then for compression the default lz4 will probably serve you perfectly well
- blog about perf
- [x] adjust your ashift accordingly
- [x] directio support
Yesterday was spent doing lots of ZFS tests. It seems like we can make it work out. Some more systematic tests today
pgbench commands, not that ideal connections and threads setting, but just wanting a baseline
/usr/pgsql-13/bin/pgbench -i -s 50 example
/usr/pgsql-13/bin/pgbench -c 10 -j 8 -t 100000 example
Note that the baseline tests for the current machines running postgresql in the IEM cluster were not isolated, but under routine load. The afos q1
and afos q2
tests were some basic queries that I want the database to perform well for.
zpool | label | pgb load [s] | pgb latency [ms] | pgb [tps] | afos q1 [s] | afos q2 [s] |
---|---|---|---|---|---|---|
N/A | metvm4 (v12) | 9.44 | 2.027 | 4,934 | - | - |
N/A | metvm6 (v12) | 6.81 | 0.810 | 12,349 | 0.003 | 85.707 |
N/A | metvm1 (v12) | 12.37 | 1.657 | 6,036 | - | - |
N/A | metvm7 (v13) | 15.40 | 2.626 | 3,808 | - | - |
N/A | laptop (v13) | 7.53 | 8.211 | 1,127 | - | - |
N/A | IRIS RHEV NVMe (v13) | - | 9.212 | 1,085 | - | - |
N/A | IRIS RHEV SSD (v13) | - | 12.768 | 783 | - | - |
N/A | XFS single | 5.80 | 0.514 | 19,471 | - | - |
N/A | XFS raid10 | 5.55 | 0.551 | 18,140 | - | - |
raidz2 | lz4_128K | 7.75 | 0.702 | 14,246 | 0.010 | 150.197 |
raidz2 | lz4_8K | 9.52 | 0.704 | 14,209 | 0.013 | 352.795 |
raidz2 | off_128K | 7.16 | 0.677 | 14,776 | 0.011 | 96.898 |
raidz2 | off_128K_metadata | 24.71 | 0.772 | 12,948 | 0.012 | (gave up) |
zmirror | off_128K | 6.37 | 0.566 | 17,663 | 0.003 | 79.678 |
zmirror | lz4_128K | 6.31 | 0.595 | 16,815 | 0.005 | 135.83 |
zmirror | lz4_64K | 6.46 | 0.577 | 17,339 | 0.004 | 143.89 |
zmirror | lz4_32K | 6.82 | 0.585 | 17,093 | 0.005 | 181.32 |
zmirror | off_8K | 7.94 | 0.634 | 15,780 | 0.004 | 268.01 |
I probably need to start moving this process along and moving back to other work, so we are drawing lines in the stand with new decisions made:
- [x] ZFS is a viable option with performance as good as XFS out of the box. Yes, a tuned XFS + RAID10 may perform better, but compression is a requirement to make this project work.
- [x] zmirror should perform better for the most common workloads vs raidz2 and offer redundancy. I am not concerned with the drop in available storage space with this choice.
- [x]
recordsize=64K
seems to be a decent middle ground between throughput and tps. Will next try to tune postgresql against that.
In general, I am not attempting to squeeze 5-10% of performance out of this setup, but just get something that does not go up in flames under load. It would also be good to continue to move the needle to the right as additional choices are made, like cache settings.
We have standardized on recordsize=64K
and compression=lz4
. So now we iterate and rerun the tests above. Note that these are one-shot runs, so some of these are noisy due to warm caches, etc.
change | pgb load [s] | pgb latency [ms] | pgb [tps] | afos q1 [s] | afos q2 [s] | OK? |
---|---|---|---|---|---|---|
baseline | 6.46 | 0.577 | 17,339 | 0.004 | 143.89 | |
add zWAL recordsize=8K,compression=off | 7.02 | 0.607 | 16,470 | 0.010 | 147.91 | :heavy_minus_sign: |
set zWAL recordsize=64K | 7.28 | 0.605 | 16,541 | 0.001 | 142.46 | :zero: |
set zWAL recordize=8K, set PG full_write_pages=off | 7.40 | 0.541 | 18,469 | 0.009 | 147.57 | :+1: |
set PG shared_buffers=16G | 7.12 | 0.543 | 18,428 | 0.010 | 168.73 | :-1: |
set PG shared_buffers=2G | 7.01 | 0.547 | 18,271 | 0.010 | 150.19 | :zero: |
set PG shared_buffers=4G | 7.10 | 0.548 | 18,236 | 0.009 | 149.24 | moving on |
set PG max_parallel_workers_per_gather=16 et al | 7.05 | 0.544 | 18,374 | 0.009 | 57.85 | :+1: |
set PG fsync=off for funzies | 6.59 | 0.426 | 23,479 | 0.010 | 55.14 | reverting |
set PG random_page_cost=0.4 | 7.03 | 0.553 | 18,090 | 0.010 | 57.64 | reverting for now |
set zfs logbias=throughput | 7.12 | 0.580 | 17,238 | 0.010 | 56.87 | reverting |
set zfs relatime=on | 6.89 | 0.557 | 17,953 | 0.002 | 58.31 | :+1: |
set zWAL primarycache=metadata | 7.33 | 0.548 | 18,239 | 0.002 | 58.88 | :+1: |
So I am not getting much of anywhere at the moment. Perhaps it is good now to move the goalpost and run a more relevant pgbench setup, stepping back one second
$ /usr/pgsql-13/bin/pgbench -S -M prepared -t 100000 -c 32 -j 32 example
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: prepared
number of clients: 32
number of threads: 32
number of transactions per client: 100000
number of transactions actually processed: 3200000/3200000
latency average = 0.066 ms
tps = 481409.762442 (including connections establishing)
tps = 481914.500144 (excluding connections establishing)
$ /usr/pgsql-13/bin/pgbench -M prepared -t 100000 -c 32 -j 32 example
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: prepared
number of clients: 32
number of threads: 32
number of transactions per client: 100000
number of transactions actually processed: 3200000/3200000
latency average = 0.700 ms
tps = 45715.978777 (including connections establishing)
tps = 45720.701627 (excluding connections establishing)
zfs get all tank/pg13wal
zfs get all tank/pg13data_lz4_64K
After a colleague review, we now did:
-
zfs set relatime=off tank
- gave 200 GB to ZFS ARC.
No change with the most recent benchmark numbers.
:rocket: coop
production database now running on this host.
:rocket: hads
production database is now on the new hardware and the compression savings are glorious. 2.1TB -> 600 GB
Time passes and some depression sets in. I am sort of in no-man's land awaiting postgresql 14 to drop, wanting to rearrange the database ducks to align performance to the databases that are mentioned in the proposal, and adding the new services. The new server is performing great and without known issues, so that's good. I just don't get the warm fuzzies of being able to conquer the world with this thing.
Since I am conveniently lazy, I am going to drag my feet a bit longer and await the PostgreSQL 14 release due by the end of September.
PostgreSQL 14 is scheduled to be released on Sept 30.