FrameworkBenchmarks icon indicating copy to clipboard operation
FrameworkBenchmarks copied to clipboard

New Server Set up

Open NateBrady23 opened this issue 1 year ago • 56 comments

Good morning, friends!

We are working through some issues with the new servers. Nothing serious, but it's required ordering some extra parts/cables and the delay will be a bit longer. I appreciate everyone's patience while we work through this. We're attempting to get the 40-gigabit fiber setup working, some power issues, and the SFP connectors don't fit in our current enclosure.

NateBrady23 avatar Feb 09 '24 17:02 NateBrady23

Hi!

Could you @NateBrady23 please share the specs of the new servers? My framework requires some manual tuning of its configuration for the best performance, and I'd like to do that upfront, if possible.

itrofimow avatar Feb 27 '24 15:02 itrofimow

HI, the good fact will be to show, the frameworks that work better without any change !! And that need to be an enhancement to any framework !!

@NateBrady23 please run the first run with the new servers, with the last full run commit: [0ec8ed488ec87718eaee9ed05c0ffd51ca48113b] (https://github.com/TechEmpower/FrameworkBenchmarks/tree/0ec8ed488ec87718eaee9ed05c0ffd51ca48113b)

And later we need to show the last run id, from both servers.

joanhey avatar Feb 27 '24 16:02 joanhey

:confused:
please we need more info: image

We undersstand that you are busy, but please send news !!

joanhey avatar Feb 27 '24 16:02 joanhey

And that need to be an enhancement to any framework !!

In general I agree, but I prefer to tune things for the extreme use-cases, and benchmarking is definitely one of such cases. Users of my framework (myself included) are fine with tuning it for their specific production workloads, and if what you maintain hits its best numbers for any workload possible without even a slight manual tuning -- that's a thing to be really proud of, I think.

please run the first run with the new servers, with the last full run commit

This I second

itrofimow avatar Feb 27 '24 16:02 itrofimow

All machines are identical with these specs

Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz 56 logical cores, 1 socket, 1 NUMA, 64 GB RAM 40Gbit/s network SSD 960GB

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  56
  On-line CPU(s) list:   0-55
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  28
    Socket(s):           1
    Stepping:            6
    CPU max MHz:         3100.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4000.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fx
                         sr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts re
                         p_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx
                         est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_t
                         imer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single
                         ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase ts
                         c_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma
                          clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_
                         llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pt
                         s hwp hwp_act_window hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq av
                         x512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_ca
                         pabilities
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   1.3 MiB (28 instances)
  L1i:                   896 KiB (28 instances)
  L2:                    35 MiB (28 instances)
  L3:                    42 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-55
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

SSD - 960GB

Network

       description: Ethernet interface
       product: MT28908 Family [ConnectX-6]
       vendor: Mellanox Technologies
       physical id: 0
       bus info: pci@0000:10:00.0
       logical name: ens1f0np0
       version: 00
       capacity: 40Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical fibre 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.15.0-73-generic duplex=full firmware=20.33.1048 (MT_0000000594) ip=10.0.0.121 latency=0 link=yes multicast=yes port=fibre
       resources: irq:18 memory:b0000000-b1ffffff memory:b2000000-b20fffff

sebastienros avatar Mar 01 '24 20:03 sebastienros

Mellanox!? Juicy!

franz1981 avatar Mar 01 '24 20:03 franz1981

Sounds great! While the faster network won't help with the majority of the tests (only the cached queries and plaintext tests should see an improvement, and maybe the fortunes one, since it was doing around 5 Gb/s of network traffic, if I am not mistaken), the doubling of the cores and the jump from the Skylake to the Ice Lake microarchitecture should (the latter should not require Spectre mitigations that are as harsh, I believe).

56 physical cores

It is actually 28 cores and 56 threads, visible from the lscpu output.

volyrique avatar Mar 04 '24 17:03 volyrique

It is actually 28 cores and 56 threads, visible from the lscpu output.

Right, my comment is wrong.

sebastienros avatar Mar 04 '24 17:03 sebastienros

Even for a corporation, it is a pretty huge and unusual setup, especially the network part.

Only the SSD is a weird chose: a SATA version for database process? In 2024? Really?

synopse avatar Mar 04 '24 20:03 synopse

Thanks for providing the update @sebastienros! Sorry this setup is taking so long. It's been a matter of ordering things and people in the office at the right time to work on it. @msmith-techempower is doing some work with this today and I'm in on Thursday.

NateBrady23 avatar Mar 05 '24 16:03 NateBrady23

Just as a general update - I am really trying to get these up and working, but the going is slow given that I am not an IT professional by trade 😅. I know everyone, myself included, is anxious to get the continuous runs back up as soon as possible, and I don't want anyone thinking we are sitting on our hands.

msmith-techempower avatar Mar 13 '24 20:03 msmith-techempower

Another update - we have gotten the machines mostly spun up and verified (using iperf as a baseline) the 40Gbps connections over fiber. We are still trying to get each machine able to connect to the internet (which has been a slog, but I think the hardware for it should be arriving today), but once that is done we will start in on the software side of setup.

Thank you to everyone for being so patient, but I am seeing light at the end of this tunnel and hope to have runs started back up soon.

msmith-techempower avatar Mar 22 '24 16:03 msmith-techempower

@NateBrady23 please run the first run with the new servers, with the last full run commit

I second this as I updated my benchmarks in the meantime and would love to see the impact independent from the hardware changes.

Looking forward to the new environment, keep up the good work!

Kaliumhexacyanoferrat avatar Mar 23 '24 17:03 Kaliumhexacyanoferrat

I get that you guys are just about across the finish line. But I recommend updating the announement banner at the top of https://tfb-status.techempower.com/ anyway. It's a one-liner in your website's HTML (aside from publishing the change). This will encourage thousands of your site's followers and, regardless, "better late than never".

mkvalor avatar Mar 24 '24 13:03 mkvalor

@joanhey @Kaliumhexacyanoferrat Yes, the first real run from the new servers will be with the last full run's commit. Great idea.

Pinging @msmith-techempower ^

We got the "final" parts in on Friday evening at the office. Mike, give us hope for Monday or Tuesday! 🙏

NateBrady23 avatar Mar 24 '24 16:03 NateBrady23

Hardware install complete and "flash point" tested. Everything appears to be working correctly, and one of our major concerns appears to be okay (issue with power draw). Tomorrow, I'll be getting the software environments up and running and HOPEFULLY (not promising anything - yes, you Nate) get the parity commit run started. I am sure there will be more to fix/hone/etc. in the coming week or two, but we are slowly getting the new environment on its feet.

Again, thank you all for your continued patience!

msmith-techempower avatar Mar 26 '24 19:03 msmith-techempower

What version of Ubuntu are you using? 24.04 is almost there...

February 29, 2024 – Feature Freeze March 21, 2024 – User Interface Freeze April 4, 2024 – Ubuntu 24.04 Beta April 11, 2024 – Kernel Freeze April 25, 2024 – Ubuntu 24.04 LTS Released

sebastienros avatar Mar 26 '24 20:03 sebastienros

We have 22 atm, but it may end up prudent to move to 24 when it's released since it's LTS.

msmith-techempower avatar Mar 26 '24 20:03 msmith-techempower

Are you using the regular kernel or the Hardware Enablement (HWE) one, as I suggested here? Using the HWE kernel essentially eliminates the need to move to Ubuntu 24.04 (when it is out) until possibly early 2025 because it would be updated to the same release as the one that 24.04 is based on, and IMHO the differences due to other software components amount to a rounding error. The switch to the HWE is done with a simple command and a reboot.

volyrique avatar Mar 26 '24 20:03 volyrique

HWE

msmith-techempower avatar Mar 26 '24 20:03 msmith-techempower

HOWDY! Okay, I believe that we have a run started. So far, nothing seems out of the ordinary, so we will see how it plays out over the next few days.

In the meantime, please be aware that this is a first attempt, and there are sure to be issues that creep up. Please report those issues here, and we will trudge on!

Again, thank you for your continued patience!

msmith-techempower avatar Mar 27 '24 20:03 msmith-techempower

Same run with commit https://github.com/TechEmpower/FrameworkBenchmarks/tree/625684fcc442767af013de2dfd1fc90dd73f1744 That is the code and data in Round 22.

Old servers https://tfb-status.techempower.com/results/66d86090-b6d0-46b3-9752-5aa4913b2e33

New servers ~https://tfb-status.techempower.com/results/1aefa081-5641-4e7a-a712-e85c4bf3a4e1~ https://tfb-status.techempower.com/results/cdec9eaf-19ea-48d2-bfa4-df15afbe3236

joanhey avatar Mar 28 '24 01:03 joanhey

About the kernels: Last Ubuntu 22.04.4 (February 2024) change to Kernel 6.15 (from 5.15) https://ubuntu.com/about/release-cycle#ubuntu-kernel-release-cycle We didn't see this change !!

New Ubuntu 24.04 come with Kernel 6.8. And the next Ubuntu 22.04.5 also it will come with 6.8 (after 24.04).

Network-related: Linux 6.8 includes networking buffs that provide better cache efficiency. This is said to improve “TCP performances with many concurrent connections up to 40%” – a sizeable uplift, though to what degree most users will benefit is unclear.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3e7aeb78ab01

We want it, but we will check it !!

joanhey avatar Mar 28 '24 02:03 joanhey

The actual run is stuck !!

joanhey avatar Mar 28 '24 14:03 joanhey

Yes, the page is not refreshed since yesterday: last updated 2024-03-27 at 4:02 PM https://tfb-status.techempower.com/

synopse avatar Mar 28 '24 17:03 synopse

Confirmed - I am looking into it now. Appears to have been a thermal issue on the primary machine. About 4 hours (I think) into the run the machine shut itself down.

msmith-techempower avatar Mar 28 '24 17:03 msmith-techempower

Ok things are back up and running and we're still monitoring.

Just so you guys know, all of us at techempower get an email when the citrine environment stops getting updates. You don't have to add to the thread or open issues when it crashes; it may happen a few more times. But appreciate everyone's enthusiasm!

NateBrady23 avatar Mar 28 '24 18:03 NateBrady23

OKAY.

Little update. TechEmpower is located in a small office and we do not have a dedicated server rack any longer - we bought a small rack that has insulation (it's very loud), but that resulted in the switch being too close to the app server... and it produces a TON of heat which, in turn, tripped the heat sensor on the intake of the machine, which fired off a safety shutdown.

I fiddled with a bunch of setups, but what seems to be working at the moment is having the switch powered down, and plugging the fiber directly. So, App is connected to Database on 10.0.0.x, and App is connected to Client on 10.0.1.x. I tested this setup with iperf as I did with the switch and saw not appreciable difference in throughput, so I am hoping this is a fair way to test. VERY OPEN TO COMMENT HERE!

Anyway, the current run has benchmarked a couple, I am monitoring temperature (among other stats) while it is running, and hopefully we will be okay moving forward.

msmith-techempower avatar Mar 28 '24 19:03 msmith-techempower

Have no fear, the continuous run is still going on and everything looks healthy! Just an issue with tfb-status receiving updates. Should be fixed shortly.

FYI: The parity run we're doing is with Round 22 https://tfb-status.techempower.com/results/66d86090-b6d0-46b3-9752-5aa4913b2e33

I'll be out early next week; when this run completes, it will automatically start a new run from the current state of the repo.

NateBrady23 avatar Mar 29 '24 17:03 NateBrady23

Impressive numbers !! We'll need some time to analyze the numbers.

I think that will be good to create Round 22N, so the regular visitors can see the difference. Also it will be better to compare with Round 23.

joanhey avatar Mar 30 '24 12:03 joanhey