SDL icon indicating copy to clipboard operation
SDL copied to clipboard

improve DUFFS_LOOP_124

Open pionere opened this issue 2 years ago • 15 comments

  • use sar (shr?) instead of dec/sub
  • use sar (shr?) instead of (optimized) idiv
  • check against zero only at one place
  • use pixel_copy_increment4 only once

pionere avatar Dec 11 '22 10:12 pionere

What testing have you done to verify these changes and test the speedup?

slouken avatar Dec 11 '22 15:12 slouken

checked the generated code on godbolt.org

pionere avatar Dec 11 '22 16:12 pionere

How did you verify correctness, and what was your benchmark setup and results?

slouken avatar Dec 11 '22 16:12 slouken

what correctness? what benchmark setup? what results? which point are you questioning?

pionere avatar Dec 11 '22 18:12 pionere

All of them. Typically when we make changes to the blitting code for optimization, we run tests to make sure it yields the same results (testautomation is useful here, and there may already be a test that covers this), and we benchmark on a variety of hardware to make sure it really is a performance improvement across low and high end systems.

So I'm asking, have you done that, and if so, what were the results?

slouken avatar Dec 11 '22 19:12 slouken

You know what? Forget it.

@icculus : I would suggest that you reconsider who handles which PR. I'm a contributor, not your slave. I have much better things to do than to explain obvious things to folks. It should not be necessary to 'benchmark' that a signed division (even if it is a power of 2) is slower than a shift, or that less branching is more optimal, or that copying 50+ lines of code in a Duff's device is suboptimal. This is not the first time. Good luck.

pionere avatar Dec 11 '22 19:12 pionere

Let's all take a breath here. I haven't looked at this yet, but I'm going to reopen it so I remember to.

The blitting code is very old, and sometimes very delicate, so we ask a lot of questions whenever changes are made to it.

icculus avatar Dec 11 '22 21:12 icculus

it is.

pionere avatar Dec 11 '22 21:12 pionere

It doesn't seem so obvious. the signed by 8 division is probably optimized by the compiler. and using twice pixel_copy_increment4 would allow more prefetch

in fact, I just quickly tested with:

    Uint64    T1;
    Uint64    T2;
    Uint64    T;
    SDL_Surface *s1 = SDL_CreateSurface(20000, 20000, SDL_PIXELFORMAT_RGB565);
    SDL_Surface *s2 = SDL_CreateSurface(20000, 20000, SDL_PIXELFORMAT_RGB565);
    SDL_SetSurfaceBlendMode(s1, SDL_BLENDMODE_BLEND);
    SDL_SetSurfaceAlphaMod(s1, 30);
    SDL_Log("start");
    T1 = SDL_GetPerformanceCounter();                                                   
    SDL_BlitSurface(s1, NULL, s2, NULL);
    T2 = SDL_GetPerformanceCounter();                                                   
    T = (T2 - T1) / 1000000;
    SDL_Log("T: %7lu", T);
    SDL_Log("end");

the new version would do something between 145 and 150 on my machine. whereas the current version is between 138/ 144. not a big difference, but it doesn't run faster

(please @icculus don't fire @slouken he's doing a good work :))

1bsyl avatar Dec 12 '22 08:12 1bsyl

It doesn't seem so obvious.

which point?

the signed by 8 division is probably optimized by the compiler.

It definitely is not optimized more than a standard idiv 8. Have you even looked at the code? Why are you making unsubstantiated claims?

and using twice pixel_copy_increment4 would allow more prefetch

Well, on my system it is 314190 (new version) vs 303713 (current version). I think it actually proves my point that it is pointless to triplicate the code, because as soon as the application is doing something other than blitting, it could use the resources in a better way.

@slouken he's doing a good work :))

let's just agree to disagree here. He did not even read my replies, did not even answered my questions. So yes, just pat each other's shoulder, this is the way to go. But what to expect from someone who makes unsubstantiated claims?

pionere avatar Dec 12 '22 11:12 pionere

which point?

That the overall pr is an improvement

It definitely is not optimized more than a standard idiv 8. Have you even looked at the code? Why are you making unsubstantiated claims?

Didn't checked. In fact I don't care: the compiler can do it, the processor unit can do it. the pipeline can do it. Just check the final result gives to have real impact.

and using twice pixel_copy_increment4 would allow more prefetch

Well, on my system it is 314190 (new version) vs 303713 (current version). I think it actually proves my point that it is pointless to triplicate the code, because as soon as the application is doing something other than blitting, it could use the resources in a better way.

but the result is 200x slower! Is this debug mode ?

let's just agree to disagree here. He did not even read my replies, did not even answered my questions. So yes, just pat each other's shoulder, this is the way to go. But what to expect from someone who makes unsubstantiated claims?

There is no unsubstantiated claims here. There is an hypothesis and a check. I wrote the 5 lines you didn't care to write.

1bsyl avatar Dec 12 '22 11:12 1bsyl

which point?

That the overall pr is an improvement

Vague claim again.

Didn't checked. In fact I don't care: the compiler can do it, the processor unit can do it. the pipeline can do it. Just check the final result gives to have real impact.

Hahahahahahah. How could it do it? Unsubstantiated claim again.

but the result is 200x slower! Is this debug mode ?

Nope. I changed a code a bit (surface is 10000x10000), because I'm still using SDL2 and did not divide the result.

There is no unsubstantiated claims here. There is an hypothesis and a check.

Bullshit, vague claim again.

I wrote the 5 lines you didn't care to write.

Because figuring out those 5 lines of code would have taken half of my day. Sorry, but there are people who are not as familiar with SDL code....

pionere avatar Dec 12 '22 12:12 pionere

Nope. I changed a code a bit (surface is 10000x10000), because I'm still using SDL2 and did not divide the result.

Then this can give imprecise result. because the blit takes too long and your system may be interfering too much underneath. I think it is better to have a ~500 ms test you can just a ten a of time, than a long bench you run only once.

1bsyl avatar Dec 12 '22 12:12 1bsyl

run 10 * (10 * (10000;10000)). The average runtimes: new version:

  1. round 3138502
  2. round 3108472

current version:

  1. round 3084685
  2. round 3101392

no difference imho....

pionere avatar Dec 12 '22 14:12 pionere

not a big difference, but might be measurable. If you reduce the size, you reduce the noise.

1bsyl avatar Dec 12 '22 14:12 1bsyl

It looks like this didn't actually improve performance, so I went ahead and closed this.

slouken avatar Mar 10 '24 16:03 slouken