SDL
SDL copied to clipboard
improve DUFFS_LOOP_124
- use sar (shr?) instead of dec/sub
- use sar (shr?) instead of (optimized) idiv
- check against zero only at one place
- use pixel_copy_increment4 only once
What testing have you done to verify these changes and test the speedup?
checked the generated code on godbolt.org
How did you verify correctness, and what was your benchmark setup and results?
what correctness? what benchmark setup? what results? which point are you questioning?
All of them. Typically when we make changes to the blitting code for optimization, we run tests to make sure it yields the same results (testautomation is useful here, and there may already be a test that covers this), and we benchmark on a variety of hardware to make sure it really is a performance improvement across low and high end systems.
So I'm asking, have you done that, and if so, what were the results?
You know what? Forget it.
@icculus : I would suggest that you reconsider who handles which PR. I'm a contributor, not your slave. I have much better things to do than to explain obvious things to folks. It should not be necessary to 'benchmark' that a signed division (even if it is a power of 2) is slower than a shift, or that less branching is more optimal, or that copying 50+ lines of code in a Duff's device is suboptimal. This is not the first time. Good luck.
Let's all take a breath here. I haven't looked at this yet, but I'm going to reopen it so I remember to.
The blitting code is very old, and sometimes very delicate, so we ask a lot of questions whenever changes are made to it.
It doesn't seem so obvious. the signed by 8 division is probably optimized by the compiler. and using twice pixel_copy_increment4 would allow more prefetch
in fact, I just quickly tested with:
Uint64 T1;
Uint64 T2;
Uint64 T;
SDL_Surface *s1 = SDL_CreateSurface(20000, 20000, SDL_PIXELFORMAT_RGB565);
SDL_Surface *s2 = SDL_CreateSurface(20000, 20000, SDL_PIXELFORMAT_RGB565);
SDL_SetSurfaceBlendMode(s1, SDL_BLENDMODE_BLEND);
SDL_SetSurfaceAlphaMod(s1, 30);
SDL_Log("start");
T1 = SDL_GetPerformanceCounter();
SDL_BlitSurface(s1, NULL, s2, NULL);
T2 = SDL_GetPerformanceCounter();
T = (T2 - T1) / 1000000;
SDL_Log("T: %7lu", T);
SDL_Log("end");
the new version would do something between 145 and 150 on my machine. whereas the current version is between 138/ 144. not a big difference, but it doesn't run faster
(please @icculus don't fire @slouken he's doing a good work :))
It doesn't seem so obvious.
which point?
the signed by 8 division is probably optimized by the compiler.
It definitely is not optimized more than a standard idiv 8. Have you even looked at the code? Why are you making unsubstantiated claims?
and using twice pixel_copy_increment4 would allow more prefetch
Well, on my system it is 314190 (new version) vs 303713 (current version). I think it actually proves my point that it is pointless to triplicate the code, because as soon as the application is doing something other than blitting, it could use the resources in a better way.
@slouken he's doing a good work :))
let's just agree to disagree here. He did not even read my replies, did not even answered my questions. So yes, just pat each other's shoulder, this is the way to go. But what to expect from someone who makes unsubstantiated claims?
which point?
That the overall pr is an improvement
It definitely is not optimized more than a standard idiv 8. Have you even looked at the code? Why are you making unsubstantiated claims?
Didn't checked. In fact I don't care: the compiler can do it, the processor unit can do it. the pipeline can do it. Just check the final result gives to have real impact.
and using twice pixel_copy_increment4 would allow more prefetch
Well, on my system it is 314190 (new version) vs 303713 (current version). I think it actually proves my point that it is pointless to triplicate the code, because as soon as the application is doing something other than blitting, it could use the resources in a better way.
but the result is 200x slower! Is this debug mode ?
let's just agree to disagree here. He did not even read my replies, did not even answered my questions. So yes, just pat each other's shoulder, this is the way to go. But what to expect from someone who makes unsubstantiated claims?
There is no unsubstantiated claims here. There is an hypothesis and a check. I wrote the 5 lines you didn't care to write.
which point?
That the overall pr is an improvement
Vague claim again.
Didn't checked. In fact I don't care: the compiler can do it, the processor unit can do it. the pipeline can do it. Just check the final result gives to have real impact.
Hahahahahahah. How could it do it? Unsubstantiated claim again.
but the result is 200x slower! Is this debug mode ?
Nope. I changed a code a bit (surface is 10000x10000), because I'm still using SDL2 and did not divide the result.
There is no unsubstantiated claims here. There is an hypothesis and a check.
Bullshit, vague claim again.
I wrote the 5 lines you didn't care to write.
Because figuring out those 5 lines of code would have taken half of my day. Sorry, but there are people who are not as familiar with SDL code....
Nope. I changed a code a bit (surface is 10000x10000), because I'm still using SDL2 and did not divide the result.
Then this can give imprecise result. because the blit takes too long and your system may be interfering too much underneath. I think it is better to have a ~500 ms test you can just a ten a of time, than a long bench you run only once.
run 10 * (10 * (10000;10000)). The average runtimes: new version:
- round 3138502
- round 3108472
current version:
- round 3084685
- round 3101392
no difference imho....
not a big difference, but might be measurable. If you reduce the size, you reduce the noise.
It looks like this didn't actually improve performance, so I went ahead and closed this.