Hetzner Cloud CCX33 instance machine details
@gunnarmorling could you perhaps post details on this machine. Hetzner info does not include specific CPU, memory configuration incl. bandwidth etc. Would be interesting to determine utilization etc.
Happy to, if I can. Any specific commands I should run whose output you'd like to see?
Not sure what is best on linux. lscpu as a minimum perhaps. hwinfo or similar if possible.
Would if possible in vm like to know specific CPU core/arch, Zen 3?, cache configuration L1/L2 etc, freq, memory configuration (channels, clock, bw). Supported ISA, AVX, AVX-512?
It would be great if we could see the output of lscpu, it seems Hetzner uses a mix of Milan and Genoa processors for their "dedicated vCPUs" instances
Here it is. I.e. EPYC-Milan:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: AuthenticAMD
Model name: AMD EPYC-Milan Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 1
BogoMIPS: 4792.79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xget
bv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke rdpid fsrm
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 2 MiB (4 instances)
L3: 32 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Vulnerable: Safe RET, no microcode
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
it seems Hetzner uses a mix of Milan and Genoa processors for their "dedicated vCPUs" instances
Ah, that's interesting, where did you get that info from? Might explain the much better numbers which @ebarlas reported after setting up his own CCX33 instance (in another, newer Hetzner DC).
@gunnarmorling https://www.hetzner.com/cloud in the dedicated vCPU tab:
Optimize your workload with AMD Milan EPYC™ 7003 and AMD Genoa EPYC™ 9654 processors.
Ah, I see. Seems there's no way to find out which one you'd get? Unless it's unique per DC. Kinda bizarre, definitely an interesting learning for me from this challenge :)
Message ID: @.***>
Which DC did you rent from?
It seems that Milan has no AVX-512 while Genoa does. Wikipedia claims that AVX-512 on EPYC is available on Zen 4 and later. Milan is Zen 3 and Genoa is Zen 4.
It seems that Milan has no AVX-512 while Genoa does. Wikipedia claims that AVX-512 on EPYC is available on Zen 4 and later. Milan is Zen 3 and Genoa is Zen 4.
AVX-512 is officially supported by Zen 4, look e.g. here: https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf (look for 'avx-512' mentions). Zen 3 doesn't support AVX-512, but only AVX2 and below.
Presence of AVX-512 will probably affect performance of all vectorized code (autovectorized and/or manually vectorized using Vector API from Project Panama).
If you have AVX-512-capable CPU then you can measure the difference by running JVM with -XX:UseAVX=2 (or something like that) to limit the AVX level used by JVM (IIRC original AVX=1, AVX2=2, AVX-512=3).
Ah, that's great insight, thx for sharing!
I'm a bit blocked right now with evaluations: it seems my instance got moved to a different host, as I'm observing substantially different (read, better) numbers as of today, making any new measurements not comparable with previous runs. I've opened a ticket with Hetzner to see what's going on, but I might have to look for a more reliable alternative.
So I am considering to get an AMD EPYC 7401P from the Hetzner Server Auction. That's Zen 1, i.e. I reckon slower per core, but then it has 24 cores :) Like Zen 3, it has AVX2. Numbers wouldn't be comparable of course, but once we've set up hyperfine, it shouldn't be a problem to run all entries again and update the leader board accordingly (apart from the overall absolute shift, there might be relative changes in case different contenders handle the increased core number differently).
My biggest question there is around administering the thing (e.g. how to disable turbo-boost and SMT, which would be a good idea), as I'm not super-savvy when it comes to that.
The alternative would be to re-run everything on the existing instance (which is much faster as of today, no idea why). But I don't feel very confident about it, not being sure whether there might not be a change in performance again. I've also asked the community for help, let's see what comes out of it.
Open for any help and suggestions of course :)
Zen 1 has very high penalties for inter-chiplet communication (actually even inter-CCX communication). Zen 2 brought the central IOD (IO die) and made the inter-chiplet communication much more robust and faster. If you're going for Zen, I would recommend at least Zen 2.
There's Zen 4 available in the form of AX52 server with Ryzen 7 7700: https://www.hetzner.com/dedicated-rootserver/matrix-ax . It has single CCX, so it should be easy to tune multithreading for that chip. Server finder https://www.hetzner.com/dedicated-rootserver show it's "available in few minutes", but direct link to search results (in that server finder) somehow doesn't work.
A dedicated machine is by far the most important here, with min Avx2 support. I am not too worried about cache hierarchy here given the highly parallellizable problem and all solutions doing chunks per processor. However, Zen 1 has some issues with certain simd/Avx2 instructions, high latencies etc. Not sure any such are or will be used here though due to the simple usages. load, cmp, movemask, lzcnt, etc.
disks are not important given entire file cached in memory. the more cores the less difference in efficiency probably. more limited by mem bw/cache.
I don't know how good java AVX-512 support is, but would not see as requirement also harder for most to test locally since many don't have dev machines with it. I don't for example.
There's Zen 4 available in the form of AX52 server with Ryzen 7 7700
@shipilev recommended to use EPYC rather than Ryzen; the reasoning is a bit above my pay grade, though :) There's the AX161 EPYC 7502P (Zen2), though a bits towards the pricier end. Oh the options... .
I would rather take a Zen 3 consumer dedicated machine than a server based simply on this better matching what developers have at their disposal, this can quickly become a race for who has access to certain machines or similar. It already is a diverse set of CPUs out there of course, though.
Thermal throttling due to boosting is an issue on most modern CPUs anyway.
Better to invest in more rigorous and statically sound benchmarking. In dotnet we would always use BenchmarkDotNet, and forget about process start/stop.
On Mon, Jan 8, 2024, 22:51 Gunnar Morling @.***> wrote:
There's Zen 4 available in the form of AX52 server with Ryzen 7 7700
@shipilev https://github.com/shipilev recommended to use EPYC rather than Ryzen; the reasoning is a bit above my pay grade, though :) There's the AX161 https://www.hetzner.com/dedicated-rootserver/ax161 EPYC 7502P (Zen2), though a bits towards the pricier end. Oh the options... .
— Reply to this email directly, view it on GitHub https://github.com/gunnarmorling/1brc/issues/189#issuecomment-1881877799, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSMN3Z2HXS2QKLE6ROI3JDYNRS43AVCNFSM6AAAAABBP2DKD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRHA3TONZZHE . You are receiving this because you authored the thread.Message ID: @.***>
My biggest question there is around administering the thing (e.g. how to disable turbo-boost and SMT, which would be a good idea), as I'm not super-savvy when it comes to that.
I have a script that sets my CPU to max non-turbo frequency:
sudo cpupower frequency-set -g performance
sudo cpupower frequency-set -u 3400MHz -d 3400MHz
Not sure if it would work on others' machines.
As for SMT: why disable it? If someone doesn't want it, then setting CPU affinity mask would amount to the same. Details: https://linux.die.net/man/1/taskset
I don't know how good java AVX-512 support is, but would not see as requirement also harder for most to test locally since many don't have dev machines with it. I don't for example.
As I said before, you can run JVM with -XX:UseAVX=2 and avoid AVX-512-related surprises.
AVX-512 adds support for lane masking and that could potentially allow more interesting programs. But, OTOH, Zen 4 has some bad AVX-512 instructions implementations. https://www.hwcooling.net/en/how-good-is-amds-avx-512-does-it-improve-zen-4-performance/ says:
On the other hand there are some instructions that perform much worse than what is necessitated by the use of 256-bit units and 256bit load/store. That is a case with Compress (vcompressd) operations and Scatter/Gather performance is also poor. Scatter/Gather is also not great on Intel, but to a lesser extent.
As for Ryzen vs Epyc choice: the Ryzen servers that Hetzner provides can have ECC enabled for RAM. The configuration page https://www.hetzner.com/dedicated-rootserver/ax52/configurator says:
Upgrade to ECC RAM €4.76 monthly
That should make the server reliable enough for lots of heavy workload hammering :) I'm not sure what improvements Epyc would bring on top of that.
My biggest question there is around administering the thing (e.g. how to disable turbo-boost and SMT, which would be a good idea), as I'm not super-savvy when it comes to that.
I think we shouldn't disable SMT. Some solutions scale perfectly well with hyperthreading (i.e nearly 2x performance when run at 16 threads on a 8c16t PC).
I think at the end of the contest, maybe top 10 solutions are selected and run again on a dedicated physical machine (no remote server), with turbo boost disabled, and as few as possible background processes running.
I agree on not disabling SMT. SMT was used before and workload scales fine on it.
AX41-NVMe Zen2 3600 or AX52 Zen4 7700 both seem like good enough options.
I'd then exclude AVX-512 usage mainly due to how many devs have access if zen 4.
On Tue, Jan 9, 2024, 06:10 lehuyduc @.***> wrote:
My biggest question there is around administering the thing (e.g. how to disable turbo-boost and SMT, which would be a good idea), as I'm not super-savvy when it comes to that.
I think we shouldn't disable SMT. Some solutions scale perfectly well with hyperthreading (i.e nearly 2x performance when run at 16 threads on a 8c16t PC).
— Reply to this email directly, view it on GitHub https://github.com/gunnarmorling/1brc/issues/189#issuecomment-1882422469, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSMN335YL2SG7OLTSG5AELYNTGLHAVCNFSM6AAAAABBP2DKD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBSGQZDENBWHE . You are receiving this because you authored the thread.Message ID: @.***>
Ok, so if we want to stick to Hetzner (which I'd prefer, so as to limit the search space somewhat), it seems AX52 (AMD Ryzen™ 7 7700, Zen 4) would be the best fit. I'm just not sure whether turbo boost can be disabled in that setting? But I'm also not sure how much that may skew results? CC @rschwietzke
I think at the end of the contest, maybe top 10 solutions are selected and run again on a dedicated physical machine (no remote server), with turbo boost disabled, and as few as possible background processes running.
Ha yeah, would love that. Just would need to get my hands on one :)
I think turbo boost should be left on. Thermal throttling is part of the game and rigorous benchmarking will show this, e.g. high variation in results or similar. It's a dedicated DC server, it should have consistent and reliable cooling. Little fluctuation. Handle it by better benchmarking. Make it part of the challenge.
If someone can make a single threaded solution at highest single (or fewer) threaded boost clock that is faster than (all hardware threads) multi threaded lower boost clock then that's game.
I would say off because: It is not that much of a boost for "desktop" CPUs and it might turn the execution order into a factor as well as the CPU usage. Turbo often works only when only a few core are on.
Cloud machines (which is the deployment norm) don't have turbo modes at all.
And yes, we can turn that off for AMD (do that for my notebook sometimes).
Yeah, the motivation for turning it off would be better comparability between different contenders, in particular you'd want to avoid one subsequent run to suffer from being throttled due to a previous run. I suppose one could kinda get on top of it by pausing in between, but that's more voodoo than anything else.
And yes, we can turn that off for AMD (do that for my notebook sometimes).
Do you do that in the BIOS or at OS level? Because I reckon the former isn't available with Hetzner dedicated host (if one only could try out before committing to it...).
OS level. SYSCTL settings.
Alrighty, after confering some more with @rschwietzke and @shipilev, I'm gonna set up an AX161 (AMD EPYC™ 7502P, 32 Core Rome (Zen2)). It's the same in terms of ISA as the original one (i.e. no AVX-512), which also nice. I'll run in 8 cores, as before.
I'm gonna close this one. We've moved to aforementioned instance AX161, and the leader board has been updated to reflect this move.
(..) I'll run in 8 cores, as before.
that means 1 thread per each of the 8 cores? zen has 2 threads per core. is smt disabled now?
the original 8 vcpu cloud machine probably had just 4 cores with 2 vcpu / core = 8 vcpus total. the lscpu for the previous cloud machine said:
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: AuthenticAMD
Model name: AMD EPYC-Milan Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 1
BogoMIPS: 4792.79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xget
bv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke rdpid fsrm
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 2 MiB (4 instances)
L3: 32 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
it says 4 cores per socket and 1 socket, so 4 cores total, but 2 threads per core = 8 threads total. also 4 instances of L1 and L2 cache says that there were just 4 cores.
Yes, SMT is disabled, and we run on eight cores out of 32 via numctl. This is the lscpu output of the new machine:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-31
Off-line CPU(s) list: 32-63
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7502P 32-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
Stepping: 0
Frequency boost: disabled
CPU max MHz: 2500.0000
CPU min MHz: 0.0000
BogoMIPS: 4990.70
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 mov
be popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx
2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pa
usefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 1 MiB (32 instances)
L1i: 1 MiB (32 instances)
L2: 16 MiB (32 instances)
L3: 128 MiB (8 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Mitigation; untrained return thunk; SMT disabled
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
I am planning to run the Top 5 or so on all 32 cores (64 threads with SMT) towards the end of the challenge, so as to see how far we can push it below 1 sec :)