cpu_features icon indicating copy to clipboard operation
cpu_features copied to clipboard

Frequency scaling can hurt, even in same family

Open twirrim opened this issue 7 years ago • 8 comments

This looks like a fantastically useful library.

Just a point for consideration, given that one of the targets is to make it easier to write fast code: https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

Note that same generation, same feature set can have wildly different performance depending on if it's "Silver", "Gold" or "Platinum", none of which is exposed by flags. Silver, for example, is de-tuned for AVX-512 (as someone puts it in the comments there).

You can parse that out of /proc/cpuinfo, if you're solely looking at Linux use cases, but this seems like something that would be ideal to find out via this library.

twirrim avatar Feb 07 '18 22:02 twirrim

Regarding AVX-512 support in Intel processors using the Skylake Xeon microarchitecture (Intel Xeon Scalable, Xeon W, and Core X-Series processors - see https://github.com/jeffhammond/vpu-count for details), below are the three methods of querying this that I have identified.

I work Intel, but please don't take that to mean this information is official or more authoritative than what somebody else could come up with. My implementation (the second method below) merely implements what is in public documentation. I have not evaluated the third method, because I work in userspace.

Method Pros Cons Code
Empirical measurement Supports pre-production and off-roadmap SKUs Relatively slow (1); sensitive to noise and oversubscription (2) here (3)
Processor name from CPUID Fast (4); ring 3; insensitive to noise and oversubscription Only supports parts listed in Intel ARK here
PIROM/SMBus Fast (unverified) Requires ring 0 N/A

Notes

  1. Approximately 350 microseconds, based upon my measurements on an Intel Xeon Scalable 8180 processor.
  2. If called from every hardware thread, the performance measurements used to determine how many VPUs are present is noisy and can give incorrect results some of the time. I have not tested in a virtualized environment, but it is possible that something similar would happen there.
  3. This is a transcription of the code provided in the Intel SDM.
  4. Approximately 2.2 microseconds, based upon my measurements on a Xeon Scalable 8180 processor. The time spent is dominated by acquiring the processor name string from CPUID, not the processing thereof.

jeffhammond avatar Feb 10 '18 16:02 jeffhammond

@twirrim thank you for reporting. Yes indeed that would be super useful to have a more fine grained information about performance of implementation. We'd need to come up with a stable scheme to assess performance of implementation. Maybe denormalized bits: avx_silver, avx_gold, avx_platinum? What do you guys think?

@jeffhammond Thank you very much for sharing your insights. I'll take some time to read everything carefully and I'll get back to you.

gchatelet avatar Feb 12 '18 08:02 gchatelet

@gchatelet Unfortunately, platinum/gold/silver/bronze isn't the right attribute for determining AVX-512 throughput on Skylake Xeon. While the correlation between metals and VPUs is constant for platinum (2) and silver/bronze (1), the gold 6xxx SKUs have 2 and the gold 5xxx SKUS have 1 except for 5122, which has 2. (Don't ask me why it is this way or if I like it.)

The attribute you'll want to support is "# of AVX-512 FMA units", which is a field in the ARK listing and correlates directly with AVX-512 performance characteristics.

jeffhammond avatar Feb 12 '18 13:02 jeffhammond

Number of FMA also sounds very useful but the original report seems to focus on the AVX-512 turbo clock.

Even non-complex 512-bit instructions (e.g. XOR) apparently cause heavy throttling on silver, especially if several cores are active.

I guess the ideal would be to publish those turbo curves?

jan-wassenberg avatar Feb 12 '18 13:02 jan-wassenberg

@jan-wassenberg Indeed, I got stuck on the VPU count aspect of e.g. Silver. Sorry about that. Giving access to the documented base and turbo frequencies for non-AVX, AVX, AVX-512 would be useful.

The tricky part is that this is a function of how many cores are active, so the implementation requires a 3D array of [SKU, SIMD-width, #active cores]. I have that information for every Skylake Xeon part, usually in a text-based format.

https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf contains the frequencies for Xeon Scalable but not in text format. I can't find the public documentation for Xeon W or Core X-series. I'll work on getting the text-based format docs to you for everything.

jeffhammond avatar Feb 12 '18 14:02 jeffhammond

@jeffhammond - were you able to dig up the AVX2 and AVX-512 frequencies for the W series?

travisdowns avatar Aug 16 '18 03:08 travisdowns

For anyone who stumbles across this thread looking for the AVX turbo frequencies for the Skylake-W chips (as I did), this AnandTech article is the only source I know of at the moment. Most of the W chips are in there, although a few are missing such as the W-2104, which doesn't have TurboBoost (but does have lower AVX and AVX512 speeds).

travisdowns avatar Oct 18 '18 05:10 travisdowns

Actually, you can detect 2nd FMA in PIROM avx512_2ndFMA (5.4.11.2 PFF: Processor Feature Flag) 70h, bit 0 https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-scalable-datasheet-vol-1.pdf

aregm avatar Apr 25 '19 19:04 aregm