Polyester.jl icon indicating copy to clipboard operation
Polyester.jl copied to clipboard

Weak and strong scaling tests

Open ignace-computing opened this issue 4 years ago • 1 comments

Hello.

Could you please comment on the existence of weak and strong scaling tests withing CheapThreads.jl? Would it be useful to implement such a thing? Of course this type of tests strongly depends on the problem that is simulated and also on its implementation details. I think, however, that it could be a nice way for potential users to discover the quality and usefulness of CheapThreads.

Best, PS: sorry if this is not the right place for such questions. PS: For more info on weak and strong scaling: see this link for instance.

ignace-computing avatar May 12 '21 20:05 ignace-computing

No one's written any tests, but you could try an example like from that link, using @batch to parallelize. Note that the results will be heavily problem dependent. E.g., if the operation is primarily memory bound, then scaling will be bad.

Depending on the CPU, as few as 1 core can utilize all the memory bandwidth, meaning memory accesses could sometimes be modeled as completely serial.

julia> memory_bandwidth(verbose=true, multithreading=false)
╔══╡ Single-threaded:
╠══╡ (4 threads)
╟─ COPY:  144299.4 MB/s
╟─ SCALE: 144522.2 MB/s
╟─ ADD:   128922.2 MB/s
╟─ TRIAD: 128925.4 MB/s
╟─────────────────────
║ Median: 136612.4 MB/s
╚═════════════════════
(median = 136612.4, minimum = 128922.2, maximum = 144522.2)

julia> memory_bandwidth(verbose=true, multithreading=true)
╔══╡ Multi-threaded:
╠══╡ (4 threads)
╟─ COPY:  144299.4 MB/s
╟─ SCALE: 144522.2 MB/s
╟─ ADD:   127744.3 MB/s
╟─ TRIAD: 128530.3 MB/s
╟─────────────────────
║ Median: 136414.9 MB/s
╚═════════════════════
(median = 136414.9, minimum = 127744.3, maximum = 144522.2)

julia> versioninfo()
Julia Version 1.7.0-DEV.1088
Commit 6cebd28e66* (2021-05-11 14:04 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 4

The M1 Mac has 0 improvement from multithreading in this benchmark. But many other programs aren't constrained by memory bandwidth, and these will benefit from more cores.

So, I'd suggest picking a problem of interest and trying CheapThreads.@batch and/or Threads.@threads, and observing how they scale.

chriselrod avatar May 12 '21 21:05 chriselrod