rays
rays copied to clipboard
Confusing about (un)optimization for all langs
(This issue is also related to @tkalbitz's PR #13)
In bc029c5, @kid0m4n has changed lazy computation for bounce. I think this project is not a serious optimization contest, so this change itself is okay.
But should we write more strict code ? I'm confusing about following things:
- This unoptimization seems opposite to "Why optimize the base algorithm?" section in
readme.md
- This (not) ''making it easy for "not so good" compilers to play catch'' principle leads to further unoptimizations
- For example: Series of
p33
toMath.Pow(p, 99)
- For example: Series of
Example: Go's case
Old(Lazy) | New(bc029c5) | |
---|---|---|
Go | 17.93 (100.0%) | 18.78 (104.7%) |
Go (Math.Pow) | 19.83 (110.6%) | 21.00 (117.1%) |
Thanks for bringing this up. I had a self debate yesterday about this particular commit. Let me try and explain the thought process:
- The project started off initially as a good way to learn idiomatic Go
- I shifted my focus to the performance aspect when I saw that there was a large delta between "unoptimized" Go and C++ (that is when I created the post in golang-nuts)
- A lot of optimizations were done (and documented in the first blog post) as a way to see where Go lacked and what measures could be taken to close the gap as much as possible
The direction of the project "rays" has definitely shifted now. In my mind "its about seeing how good can a program perform in a given language/compiler/design (l/c/d) combination whilst still keeping the code as close to real world as possible."
So, there are two kinds of optimizations in my mind:
- The first version (C++) of the code scanned through the entire ART and incurred a huge cost in computation time whilst accomplishing nothing; this looked like a broken algo design, hence I fixed it by creating a objects array
- In Go, replacing math.Pow(x, 99) with a hand optimized multiplication tree to get 5 % extra perf
I still believe in retaining the first one, but like I reversed the micro-opt with https://github.com/kid0m4n/rays/commit/bc029c5 yesterday, I want to bring all the implementations up to a stage where we do not avoid stuff like math.Pow(), etc. Instead, give scope for the compiler to do the right thing for you.
Then "rays" essentially becomes a good test bed to see how much we can extract from a l/c/d combination without doing benchmark specific optimization; by letting the compiler do its thing. As much as possible.
That being said, SSE in C++ is not something we need to avoid; in fact, its a USP in the language itself that it allows us to go from 12.7 s to 9.4 s by still writing C++. SSE is not the same as replacing math.Pow() in my mind
I want to know what you think about this though
I would like to see 2 versions of code for every language
- "mainline" version
- Standard, platform independent
- Only algorithm/calculation level optimization is allowed
- "hacked" version
- Non-standard, deeply platform/language dependent
- Any kind of optimization is allowed
- But every single line is written in target language
- ex. For C++, intrinsics are allowed, but inline assembly is prohibited
"mainline" shows idiomatic way. Good for the language tourists. "hacked" shows back street of the language. Tourists should not walk into there, but locals enjoy the secret side of the language.
Some reasons
I think there are 4 ranks of goodness
- Standard, platform independent, straight forward code
- Math.Pow(), Math.rand
- Algorithm/calculation level optimization
- Pseudo lazy evaluation (algorithm)
- Replace division with reciprocal (calculation)
- Non-standard, platform dependent, deeply language dependent
- p33, rnd() (non-standard)
- SSE vector (platform/runtime environment dependent)
- PR #13 (deeply language dependent)
- Commonly used external library (ex. PCRE)
- Out of the target
- Special purpose external library
- Another language (inline asm)
I would like to see 1. and 2. in mainline of the code. But I also want to see 'insanely optimized' version by 3. 'Insane' version should not allowed to merge to mainline, but as you have seen these optimization clearly show some kind of the room and weakness.
More random thoughts:
- If we have an ideal compiler, auto-vectorization (ex. SSE optimizing) should be done by the compiler.
- Also clamping, 2D-RNG
- Usually, imperative programming language allows side effect, so compiler/interpreter could (should) not achieve lazy evaluation without special notations.
- ex. Some kind of "pure" function attributes.
- Process-wide GC is seriously bad.
- RNG is not so good. LFSR variant is widely used for this purpose
- eg. Xorshift, MT
- Or use standard library
- Division to reciprocal number multiplication conversion should be allowed.
- This conversion is not same (ex. x87) but widely used.
+1 for @t-mat
There should be a clean vanilla version as basis for a "dirty" optimized version.
+1 for a clean reference version; and a crazy all out optimized version