iem 2021 Unidata Equipment Grant Omnibus

The IEM is :pray: to have received an equipment grant from Unidata. The grant is purchasing a Dell R7525 with NVMe drives, which I should be able to rule the world with its capacity :) The setup of the server will attempt to follow what Letsencrypt did.

The proposal outlined a number of deliverables, so this issue is an omnibus tracking these items and more.

[ ] Remove or greatly relax the software throttles on the METAR, SHEF downloads.
[ ] Create more web services to expose the SHEF archives.
[ ] Add code support for various web services within siphon.
[ ] Provide CONUS scale aggregates of climodat/COOP archives.

For my reporting benefit, a timeline of how things have progressed this far.

3 May 2021 - Unidata announced funding of proposal.
30 Jun 2021 - UCAR/ISU signed off on grant paperwork.
6 Jul 2021 - Workday tag created and ready for spending.
6 Jul 2021 - Dell purchase order submitted.

So this issue will collect up random things so to help my eventual delivery of a:

[ ] blog post to Unidata.

Jul 07 '21 13:07 akrherz

Server was delivered late 19 July 2021 and setup in its final resting place on the morning of 20 July 2021. RHEL 8.4 was installed on the root RAID1 SSD 500GB. The first decision point is run kernel-ml or not so to support my legacy Infiniband network. Lets try it and run a test to see if it is fast enough to matter. Moving a 5 GB empty file via scp

network	rate	time
ipoib	327 MB/s	15s
iem1	111 MB/s	46s
iem0	112 MB/s	46s

[x] gonna stick with kernel-ml as the infiniband network is likely useful to keep around.

Jul 20 '21 14:07 akrherz

Just to establish some crude baselines

mkfs.xfs /dev/nvme0n1
mount /dev/nvme0n1 /mnt/test
# time dd if=/dev/zero of=test.iso count=10M
10485760+0 records in
10485760+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 8.39105 s, 640 MB/s

real	0m8.392s
user	0m1.019s
sys	0m7.345s
hdparm -Tt /dev/nvme0n1

/dev/nvme0n1:
 Timing cached reads:   21058 MB in  2.00 seconds = 10544.23 MB/sec
 Timing buffered disk reads: 10270 MB in  3.00 seconds = 3423.18 MB/sec

[x] Lets install Postgresql 13 to start benchmarking with it

Jul 20 '21 19:07 akrherz

Discussion with local ZFS expert SN

[x] You can probably set the logbias to throughput
[x] primarycache=metadata, since postgres maintains a cache
[x] xattr=sa (and probably acltype=posixacl), atime=off and then decide if you even want relatime
[x] then for compression the default lz4 will probably serve you perfectly well
blog about perf
[x] adjust your ashift accordingly
[x] directio support

Jul 20 '21 19:07 akrherz

Yesterday was spent doing lots of ZFS tests. It seems like we can make it work out. Some more systematic tests today

pgbench commands, not that ideal connections and threads setting, but just wanting a baseline

/usr/pgsql-13/bin/pgbench -i -s 50 example
/usr/pgsql-13/bin/pgbench -c 10 -j 8 -t 100000 example

Note that the baseline tests for the current machines running postgresql in the IEM cluster were not isolated, but under routine load. The afos q1 and afos q2 tests were some basic queries that I want the database to perform well for.

zpool	label	pgb load [s]	pgb latency [ms]	pgb [tps]	afos q1 [s]	afos q2 [s]
N/A	metvm4 (v12)	9.44	2.027	4,934	-	-
N/A	metvm6 (v12)	6.81	0.810	12,349	0.003	85.707
N/A	metvm1 (v12)	12.37	1.657	6,036	-	-
N/A	metvm7 (v13)	15.40	2.626	3,808	-	-
N/A	laptop (v13)	7.53	8.211	1,127	-	-
N/A	IRIS RHEV NVMe (v13)	-	9.212	1,085	-	-
N/A	IRIS RHEV SSD (v13)	-	12.768	783	-	-
N/A	XFS single	5.80	0.514	19,471	-	-
N/A	XFS raid10	5.55	0.551	18,140	-	-
raidz2	lz4_128K	7.75	0.702	14,246	0.010	150.197
raidz2	lz4_8K	9.52	0.704	14,209	0.013	352.795
raidz2	off_128K	7.16	0.677	14,776	0.011	96.898
raidz2	off_128K_metadata	24.71	0.772	12,948	0.012	(gave up)
zmirror	off_128K	6.37	0.566	17,663	0.003	79.678
zmirror	lz4_128K	6.31	0.595	16,815	0.005	135.83
zmirror	lz4_64K	6.46	0.577	17,339	0.004	143.89
zmirror	lz4_32K	6.82	0.585	17,093	0.005	181.32
zmirror	off_8K	7.94	0.634	15,780	0.004	268.01

I probably need to start moving this process along and moving back to other work, so we are drawing lines in the stand with new decisions made:

[x] ZFS is a viable option with performance as good as XFS out of the box. Yes, a tuned XFS + RAID10 may perform better, but compression is a requirement to make this project work.
[x] zmirror should perform better for the most common workloads vs raidz2 and offer redundancy. I am not concerned with the drop in available storage space with this choice.
[x] recordsize=64K seems to be a decent middle ground between throughput and tps. Will next try to tune postgresql against that.

In general, I am not attempting to squeeze 5-10% of performance out of this setup, but just get something that does not go up in flames under load. It would also be good to continue to move the needle to the right as additional choices are made, like cache settings.

Jul 22 '21 17:07 akrherz

We have standardized on recordsize=64K and compression=lz4. So now we iterate and rerun the tests above. Note that these are one-shot runs, so some of these are noisy due to warm caches, etc.

change	pgb load [s]	pgb latency [ms]	pgb [tps]	afos q1 [s]	afos q2 [s]	OK?
baseline	6.46	0.577	17,339	0.004	143.89
add zWAL recordsize=8K,compression=off	7.02	0.607	16,470	0.010	147.91	:heavy_minus_sign:
set zWAL recordsize=64K	7.28	0.605	16,541	0.001	142.46	:zero:
set zWAL recordize=8K, set PG full_write_pages=off	7.40	0.541	18,469	0.009	147.57	:+1:
set PG shared_buffers=16G	7.12	0.543	18,428	0.010	168.73	:-1:
set PG shared_buffers=2G	7.01	0.547	18,271	0.010	150.19	:zero:
set PG shared_buffers=4G	7.10	0.548	18,236	0.009	149.24	moving on
set PG max_parallel_workers_per_gather=16 et al	7.05	0.544	18,374	0.009	57.85	:+1:
set PG fsync=off for funzies	6.59	0.426	23,479	0.010	55.14	reverting
set PG random_page_cost=0.4	7.03	0.553	18,090	0.010	57.64	reverting for now
set zfs logbias=throughput	7.12	0.580	17,238	0.010	56.87	reverting
set zfs relatime=on	6.89	0.557	17,953	0.002	58.31	:+1:
set zWAL primarycache=metadata	7.33	0.548	18,239	0.002	58.88	:+1:

So I am not getting much of anywhere at the moment. Perhaps it is good now to move the goalpost and run a more relevant pgbench setup, stepping back one second

$ /usr/pgsql-13/bin/pgbench -S -M prepared -t 100000 -c 32 -j 32 example
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: prepared
number of clients: 32
number of threads: 32
number of transactions per client: 100000
number of transactions actually processed: 3200000/3200000
latency average = 0.066 ms
tps = 481409.762442 (including connections establishing)
tps = 481914.500144 (excluding connections establishing)
$ /usr/pgsql-13/bin/pgbench -M prepared -t 100000 -c 32 -j 32 example
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: prepared
number of clients: 32
number of threads: 32
number of transactions per client: 100000
number of transactions actually processed: 3200000/3200000
latency average = 0.700 ms
tps = 45715.978777 (including connections establishing)
tps = 45720.701627 (excluding connections establishing)

zfs get all tank/pg13wal

``` NAME PROPERTY VALUE SOURCE tank/pg13wal type filesystem - tank/pg13wal creation Thu Jul 22 12:12 2021 - tank/pg13wal used 1.01G - tank/pg13wal available 2.17T - tank/pg13wal referenced 1.01G - tank/pg13wal compressratio 1.00x - tank/pg13wal mounted yes - tank/pg13wal quota none default tank/pg13wal reservation none default tank/pg13wal recordsize 8K local tank/pg13wal mountpoint /var/lib/pgsql/13/data/pg_wal local tank/pg13wal sharenfs off default tank/pg13wal checksum on default tank/pg13wal compression off local tank/pg13wal atime off inherited from tank tank/pg13wal devices on default tank/pg13wal exec on default tank/pg13wal setuid on default tank/pg13wal readonly off default tank/pg13wal zoned off default tank/pg13wal snapdir hidden default tank/pg13wal aclmode discard default tank/pg13wal aclinherit restricted default tank/pg13wal createtxg 3724 - tank/pg13wal canmount on default tank/pg13wal xattr sa inherited from tank tank/pg13wal copies 1 default tank/pg13wal version 5 - tank/pg13wal utf8only off - tank/pg13wal normalization none - tank/pg13wal casesensitivity sensitive - tank/pg13wal vscan off default tank/pg13wal nbmand off default tank/pg13wal sharesmb off default tank/pg13wal refquota none default tank/pg13wal refreservation none default tank/pg13wal guid 1024111154508093058 - tank/pg13wal primarycache metadata local tank/pg13wal secondarycache all default tank/pg13wal usedbysnapshots 0B - tank/pg13wal usedbydataset 1.01G - tank/pg13wal usedbychildren 0B - tank/pg13wal usedbyrefreservation 0B - tank/pg13wal logbias latency local tank/pg13wal objsetid 5830 - tank/pg13wal dedup off default tank/pg13wal mlslabel none default tank/pg13wal sync standard default tank/pg13wal dnodesize legacy default tank/pg13wal refcompressratio 1.00x - tank/pg13wal written 1.01G - tank/pg13wal logicalused 1.01G - tank/pg13wal logicalreferenced 1.01G - tank/pg13wal volmode default default tank/pg13wal filesystem_limit none default tank/pg13wal snapshot_limit none default tank/pg13wal filesystem_count none default tank/pg13wal snapshot_count none default tank/pg13wal snapdev hidden default tank/pg13wal acltype off default tank/pg13wal context none default tank/pg13wal fscontext none default tank/pg13wal defcontext none default tank/pg13wal rootcontext none default tank/pg13wal relatime on inherited from tank tank/pg13wal redundant_metadata all default tank/pg13wal overlay on default tank/pg13wal encryption off default tank/pg13wal keylocation none default tank/pg13wal keyformat none default tank/pg13wal pbkdf2iters 0 default tank/pg13wal special_small_blocks 0 default ```

zfs get all tank/pg13data_lz4_64K

``` NAME PROPERTY VALUE SOURCE tank/pg13data_lz4_64K type filesystem - tank/pg13data_lz4_64K creation Thu Jul 22 10:28 2021 - tank/pg13data_lz4_64K used 367G - tank/pg13data_lz4_64K available 2.17T - tank/pg13data_lz4_64K referenced 367G - tank/pg13data_lz4_64K compressratio 1.35x - tank/pg13data_lz4_64K mounted yes - tank/pg13data_lz4_64K quota none default tank/pg13data_lz4_64K reservation none default tank/pg13data_lz4_64K recordsize 64K local tank/pg13data_lz4_64K mountpoint /var/lib/pgsql/13 local tank/pg13data_lz4_64K sharenfs off default tank/pg13data_lz4_64K checksum on default tank/pg13data_lz4_64K compression lz4 local tank/pg13data_lz4_64K atime off inherited from tank tank/pg13data_lz4_64K devices on default tank/pg13data_lz4_64K exec on default tank/pg13data_lz4_64K setuid on default tank/pg13data_lz4_64K readonly off default tank/pg13data_lz4_64K zoned off default tank/pg13data_lz4_64K snapdir hidden default tank/pg13data_lz4_64K aclmode discard default tank/pg13data_lz4_64K aclinherit restricted default tank/pg13data_lz4_64K createtxg 1427 - tank/pg13data_lz4_64K canmount on default tank/pg13data_lz4_64K xattr sa inherited from tank tank/pg13data_lz4_64K copies 1 default tank/pg13data_lz4_64K version 5 - tank/pg13data_lz4_64K utf8only off - tank/pg13data_lz4_64K normalization none - tank/pg13data_lz4_64K casesensitivity sensitive - tank/pg13data_lz4_64K vscan off default tank/pg13data_lz4_64K nbmand off default tank/pg13data_lz4_64K sharesmb off default tank/pg13data_lz4_64K refquota none default tank/pg13data_lz4_64K refreservation none default tank/pg13data_lz4_64K guid 4354124984647248473 - tank/pg13data_lz4_64K primarycache all default tank/pg13data_lz4_64K secondarycache all default tank/pg13data_lz4_64K usedbysnapshots 0B - tank/pg13data_lz4_64K usedbydataset 367G - tank/pg13data_lz4_64K usedbychildren 0B - tank/pg13data_lz4_64K usedbyrefreservation 0B - tank/pg13data_lz4_64K logbias latency local tank/pg13data_lz4_64K objsetid 5533 - tank/pg13data_lz4_64K dedup off default tank/pg13data_lz4_64K mlslabel none default tank/pg13data_lz4_64K sync standard default tank/pg13data_lz4_64K dnodesize legacy default tank/pg13data_lz4_64K refcompressratio 1.35x - tank/pg13data_lz4_64K written 367G - tank/pg13data_lz4_64K logicalused 497G - tank/pg13data_lz4_64K logicalreferenced 497G - tank/pg13data_lz4_64K volmode default default tank/pg13data_lz4_64K filesystem_limit none default tank/pg13data_lz4_64K snapshot_limit none default tank/pg13data_lz4_64K filesystem_count none default tank/pg13data_lz4_64K snapshot_count none default tank/pg13data_lz4_64K snapdev hidden default tank/pg13data_lz4_64K acltype off default tank/pg13data_lz4_64K context none default tank/pg13data_lz4_64K fscontext none default tank/pg13data_lz4_64K defcontext none default tank/pg13data_lz4_64K rootcontext none default tank/pg13data_lz4_64K relatime on inherited from tank tank/pg13data_lz4_64K redundant_metadata all default tank/pg13data_lz4_64K overlay on default tank/pg13data_lz4_64K encryption off default tank/pg13data_lz4_64K keylocation none default tank/pg13data_lz4_64K keyformat none default tank/pg13data_lz4_64K pbkdf2iters 0 default tank/pg13data_lz4_64K special_small_blocks 0 default ```

Jul 22 '21 18:07 akrherz

After a colleague review, we now did:

zfs set relatime=off tank
gave 200 GB to ZFS ARC.

No change with the most recent benchmark numbers.

Jul 22 '21 18:07 akrherz

:rocket: coop production database now running on this host.

Jul 22 '21 19:07 akrherz

:rocket: hads production database is now on the new hardware and the compression savings are glorious. 2.1TB -> 600 GB

Jul 23 '21 12:07 akrherz

Time passes and some depression sets in. I am sort of in no-man's land awaiting postgresql 14 to drop, wanting to rearrange the database ducks to align performance to the databases that are mentioned in the proposal, and adding the new services. The new server is performing great and without known issues, so that's good. I just don't get the warm fuzzies of being able to conquer the world with this thing.

Aug 09 '21 13:08 akrherz

Since I am conveniently lazy, I am going to drag my feet a bit longer and await the PostgreSQL 14 release due by the end of September.

Sep 08 '21 18:09 akrherz

PostgreSQL 14 is scheduled to be released on Sept 30.

Sep 16 '21 15:09 akrherz

iem iem copied to clipboard

2021 Unidata Equipment Grant Omnibus

iem
iem copied to clipboard