ecs_benchmark icon indicating copy to clipboard operation
ecs_benchmark copied to clipboard

Detailed and CSV Output +

Open ZeroErrors opened this issue 1 year ago • 5 comments
trafficstars

I ended up getting a little distracted and spent a bit more time on some other improvements 😄

Change Log

  • Replaced PRETTY_TIME_FMT with bench_result_format_t
    • Added --csv command line argument
    • Added --detailed command line argument
  • Improved output formatting
    • Removed (int)(36 - strlen(label)) to instead use padding with printf (eg. %36s)
    • Removed spacing(float v) and instead pad using printf (eg. %10.3f)
  • Replaced WARMUP_INTERVALS with WARMUP_INTERVAL by using it as the initial value for bench_t::interval, this reduces the number of branches in bench_next(bench_t*).
  • Removed double dt from bench_t and instead store ecs_time_t start, stop, total and changed bench_next(bench_t*) to start and stop for every sample, and accumulate the total for each sample to remove measurement overhead.
  • Renamed bench_t::intervals to bench_t::samples
  • Added do_not_optimize(uint32_t) This replaces usages of:
    printf("result = %u\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b", 
          (uint32_t)result);
    
    Also changed it to output to stderr instead of stdout so it doesn't interfere when you do ./ecs_benchmark --csv --detailed > results.csv
  • Use the warmup time to calculate an appropriate interval to help reduce the overhead on fast benchmarks

Output Format

Output now has 4 different options.

Human Readable (Default)

image

| Benchmark                             | Measurement  |
|---------------------------------------|--------------|
| baseline                              |      1.866ns |
| world_mini_fini                       |    699.110us |
| world_init_fini                       |      1.982ms |
| has_empty_entity                      |      2.218ns |
| has_id_not_found                      |      5.668ns |
| has_id                                |      6.543ns |
| has_16_ids                            |      6.787ns |
| get_empty_entity                      |      0.002ns |
| get_id_not_found                      |      0.002ns |
| get_id                                |      0.002ns |
| get_16_ids                            |      0.000ns |
| get_pair                              |      0.002ns |
| get_pair_16_targets                   |      0.000ns |
...
Human Readable Detailed

image

| Benchmark                             | Measurement  | Total (ns)   | Count          | Samples      |
|---------------------------------------|--------------|--------------|----------------|--------------|
| baseline                              |      1.844ns |   1000000100 |      542402100 |     10848042 |
| world_mini_fini                       |    717.889us |   1005044900 |           1400 |           28 |
| world_init_fini                       |      1.986ms |   1092249400 |            550 |           11 |
| has_empty_entity                      |      2.129ns |   1000031200 |      469650000 |         9393 |
| has_id_not_found                      |      5.622ns |   1000217700 |      177900000 |         3558 |
| has_id                                |      6.414ns |   1000257500 |      155950000 |         3119 |
| has_16_ids                            |      6.584ns |   1000778800 |      152000000 |          190 |
| get_empty_entity                      |      0.002ns |   1000000100 |   517687300000 |     10353746 |
| get_id_not_found                      |      0.002ns |   1000000100 |   572414600000 |     11448292 |
| get_id                                |      0.002ns |   1000000100 |   566892550000 |     11337851 |
| get_16_ids                            |      0.000ns |   1000000100 |  9136676000000 |     11420845 |
| get_pair                              |      0.002ns |   1000000100 |   526970050000 |     10539401 |
| get_pair_16_targets                   |      0.000ns |   1000000100 |  8464068800000 |     10580086 |
| get_inherited_depth_1                 |      0.002ns |   1000000100 |   572209650000 |     11444193 |
| get_inherited_depth_2                 |      0.002ns |   1000000100 |   571049300000 |     11420986 |
| get_inherited_depth_16                |      0.002ns |   1000000100 |   529902650000 |     10598053 |
...
CSV
Benchmark,Measurement (ns)
baseline,1.896159
world_mini_fini,716582.357143
world_init_fini,1856217.454545
has_empty_entity,2.176629
has_id_not_found,5.602006
has_id,6.436829
has_16_ids,6.596657
get_empty_entity,0.001925
get_id_not_found,0.001751
get_id,0.001756
...
CSV Detailed
Benchmark,Measurement (ns),Total (ns),Count,Samples
baseline,1.865273,1000000100,536114500,10722290
world_mini_fini,753041.111111,1016605500,1350,27
world_init_fini,1973357.090909,1085346400,550,11
has_empty_entity,2.130968,1000063100,469300000,9386
has_id_not_found,5.588571,1000074700,178950000,3579
has_id,6.418141,1000267300,155850000,3117
has_16_ids,6.551260,1001032600,152800000,191
get_empty_entity,0.001891,1000000100,528785450000,10575709
get_id_not_found,0.001781,1000000100,561582750000,11231655
get_id,0.001764,1000000100,566837900000,11336758
get_16_ids,0.000109,1000000100,9145068000000,11431335
get_pair,0.001926,1000001400,519112600000,10382252
...

ZeroErrors avatar Jan 07 '24 00:01 ZeroErrors

Hm, the example outputs don't look right:

| get_empty_entity                      |      0.002ns |
| get_id_not_found                      |      0.002ns |
| get_id                                |      0.002ns |
| get_16_ids                            |      0.000ns |
| get_pair                              |      0.002ns |
| get_pair_16_targets                   |      0.000ns |

These operations definitely take longer than 0.002 nanoseconds 😅

I'm a bit wary of changing the way things are measured. Do you have indications that this increases the stability of the measurements? On my hardware measurements were already pretty stable between runs.

SanderMertens avatar Jan 07 '24 20:01 SanderMertens

hmm, I'm getting this data pretty consistently but I'll do some more digging.

Edit: I found the cause of the large disparity, but will continue doing more testing.

Master: Running on master I get values like this.

| Benchmark                             | Measurement  |
|---------------------------------------|--------------|
| get_empty_entity                      | 0.12ns       |
| get_id_not_found                      | 0.14ns       |
| get_id                                | 7.27ns       |
| get_16_ids                            | 0.13ns       |
| get_pair                              | 0.32ns       |
| get_pair_16_targets                   | 1.71ns       |

Tweaked it to include wall time which is a timer measured from bench_begin to bench_end to get a bit more info. (Edit: this includes the warmup time so its not very accurate for benchmarks that take longer to run)

| Benchmark                             | Measurement  | dt (ns)        | Count          | Intervals      | Bench Count    | Wall Time      |
|---------------------------------------|--------------|----------------|----------------|----------------|----------------|----------------|
| get_empty_entity                      | 0.51ns       | 500000100.0000 |      971840192 |       12732540 |           1000 |      500013100 
| get_id_not_found                      | 0.41ns       | 500000100.0000 |     1224271008 |       13081186 |           1000 |      500000200 
| get_id                                | 0.12ns       | 500000100.0000 |     4190255600 |       12968707 |           1000 |      500000500 
| get_16_ids                            | 0.80ns       | 500000100.0000 |      625412864 |       12971583 |          16000 |      500000400 
| get_pair                              | 0.14ns       | 500000100.0000 |     3471257488 |       12696629 |           1000 |      500000300 
| get_pair_16_targets                   | 0.16ns       | 500000100.0000 |     3213355008 |       12899656 |          16000 |      500000400 

This PR: Running with the changes in this PR I get something like the following. This also is modified to add an additional column, wall time.

| Benchmark                             | Measurement  | Total (ns)   | Count          | Samples      | Interval     | Wall Time (ns) |
|---------------------------------------|--------------|--------------|----------------|--------------|--------------|----------------|
| get_empty_entity                      |      0.001ns |    506450800 |   450000000000 |            9 |     50000000 |      506452500 |
| get_id_not_found                      |      0.001ns |    504620300 |   450000000000 |            9 |     50000000 |      504622200 |
| get_id                                |      0.002ns |    550030300 |   350000000000 |            7 |     50000000 |      550031700 |
| get_16_ids                            |      0.000ns |    551502200 |  5600000000000 |            7 |     50000000 |      551503600 |
| get_pair                              |      0.001ns |    557665700 |   450000000000 |            9 |     50000000 |      557667600 |
| get_pair_16_targets                   |      0.000ns |    560586900 |  7200000000000 |            9 |     50000000 |      560588600 |



So looking at get_id since it has the largest variance. Master:

bench_count = 1000
measurement = 0.12ns
total = 500000100.0000ns
count = 4190255600
intervals = 12968707
wall_time = 500000500ns

count = intervals * MEASURE_INTERVAL * bench_count
count = 12968707 * 50 * 1000
count = 648435350000 != 4190255600 // WRONG: !!!!! >32-bit

measurement_wall_time = wall_time  / count
measurement_wall_time_fixed = 500000500ns / 648435350000 // Using the above calculated value
measurement_wall_time_fixed = 0.0007710876651ns
measurement_wall_time_wrong = 500000500ns / 4190255600   // Using the wrong value
measurement_wall_time_wrong = 0.119324582491ns == 0.12ns

measurement = total / count
measurement_fixed = 500000100.0000ns / 648435350000 // Using the above calculated value
measurement_fixed = 0.000771087048231ns 
measurement_wrong = 500000100.0000ns / 4190255600  // Using the wrong value
measurement_wrong = 0.119324487031ns == 0.12ns

This code seems to be the cause.

= b->dt / ((b->intervals - WARMUP_INTERVALS) * MEASURE_INTERVAL * b->count)

= double / ((uint32_t - int) * int * uint32_t)
= double / (uint32_t * int * uint32_t)
= double / uint32_t
= double

If I change it to cast the intervals to uint64_t eg. b->dt / (((uint64_t)b->intervals - WARMUP_INTERVALS) * MEASURE_INTERVAL * b->count) I get a measurement around 0.00076563ns instead, which is actually lower than what my changes show.. hmm...

This PR:

bench_count = 1000
measurement = 0.002ns
total = 546608600ns
count = 350000000000
wall_time = 546610100ns

count = samples * sample_interval * bench_count
count = 7 * 50000000 * 1000
count = 350000000000 // Expected

measurement_wall_time = wall_time  / count
measurement_wall_time = 546610100ns / 350000000000
measurement_wall_time = 0.00156174314286ns == 0.002ns // Expected

measurement = total / count
measurement = 546608600ns / 350000000000
measurement = 0.00156173885714ns == 0.002ns // Expected

ZeroErrors avatar Jan 08 '24 01:01 ZeroErrors

Any measurement that's below one nanosecond is very suspect. Try to cross-reference the results with the README, they should be roughly in the same ballpark.

SanderMertens avatar Jan 08 '24 02:01 SanderMertens

Just as a note, the machine I've been running these tests on is Windows and the timers only have a resolution of 100ns which is also visible in the results. Only way to reduce this impact is with a larger measure time with less samples. (I've started testing locally with a MEASURE_TIME of 2.0 to try and help with that).

Also the reason why the number is so low for me looks like compiler optimizations.

The benchmark function for get_id("get_id", 1); gets compiled to the following.

Disassembly
void get_id(const char *label, int32_t id_count) {
00007FF67FEC3450  mov         qword ptr [rsp+10h],rbx  
00007FF67FEC3455  mov         qword ptr [rsp+8],rcx  
00007FF67FEC345A  push        rbp  
00007FF67FEC345B  push        rsi  
00007FF67FEC345C  push        rdi  
00007FF67FEC345D  push        r12  
00007FF67FEC345F  push        r13  
00007FF67FEC3461  push        r14  
00007FF67FEC3463  push        r15  
00007FF67FEC3465  sub         rsp,0A0h  
00007FF67FEC346C  movsxd      r12,edx  
    ecs_world_t *world = ecs_mini();
00007FF67FEC346F  call        ecs_mini (07FF67FF6F4E0h)  
    ecs_entity_t *entities = create_ids(world, ENTITY_COUNT, 0, false);
00007FF67FEC3474  mov         ecx,1F40h  
    ecs_world_t *world = ecs_mini();
00007FF67FEC3479  mov         rsi,rax  
    ecs_entity_t *entities = create_ids(world, ENTITY_COUNT, 0, false);
00007FF67FEC347C  call        qword ptr [ecs_os_api+20h (07FF67FF98AB0h)]  
00007FF67FEC3482  xor         ebx,ebx  
00007FF67FEC3484  mov         r15d,3E8h  
00007FF67FEC348A  mov         r13,rax  
00007FF67FEC348D  nop         dword ptr [rax]  
00007FF67FEC3490  mov         rcx,rsi  
00007FF67FEC3493  call        ecs_new_id (07FF67FF6DAB0h)  
00007FF67FEC3498  mov         qword ptr [r13+rbx*8],rax  
00007FF67FEC349D  inc         rbx  
00007FF67FEC34A0  cmp         rbx,r15  
00007FF67FEC34A3  jl          get_id+40h (07FF67FEC3490h)  
    ecs_entity_t *ids = create_ids(world, id_count, 4, true);
00007FF67FEC34A5  test        r12d,r12d  
00007FF67FEC34A8  je          get_id+0C5h (07FF67FEC3515h)  
00007FF67FEC34AA  lea         ecx,[r12*8]  
00007FF67FEC34B2  call        qword ptr [ecs_os_api+20h (07FF67FF98AB0h)]  
00007FF67FEC34B8  mov         r14,rax  
00007FF67FEC34BB  test        r12d,r12d  
00007FF67FEC34BE  jle         get_id+0C8h (07FF67FEC3518h)  
00007FF67FEC34C0  mov         rbx,rax  
00007FF67FEC34C3  mov         edi,r12d  
00007FF67FEC34C6  mov         rcx,rsi  
00007FF67FEC34C9  call        ecs_new_low_id (07FF67FF6FAF0h)  
00007FF67FEC34CE  mov         qword ptr [rbx],rax  
00007FF67FEC34D1  lea         rcx,[rsp+0F0h]  
00007FF67FEC34D9  mov         r9d,8  
00007FF67FEC34DF  mov         qword ptr [rsp+20h],rcx  
00007FF67FEC34E4  mov         rdx,rax  
00007FF67FEC34E7  mov         dword ptr [rsp+0F0h],4  
00007FF67FEC34F2  mov         rcx,rsi  
00007FF67FEC34F5  mov         dword ptr [rsp+0F4h],4  
00007FF67FEC3500  lea         r8d,[r9-7]  
00007FF67FEC3504  call        ecs_set_id (07FF67FF71070h)  
00007FF67FEC3509  lea         rbx,[rbx+8]  
00007FF67FEC350D  sub         rdi,1  
00007FF67FEC3511  jne         get_id+76h (07FF67FEC34C6h)  
00007FF67FEC3513  jmp         get_id+0C8h (07FF67FEC3518h)  
00007FF67FEC3515  xor         r14d,r14d  
        for (int i = 0; i < id_count; i ++) {
00007FF67FEC3518  mov         rdi,r13  
00007FF67FEC351B  nop         dword ptr [rax+rax]  
00007FF67FEC3520  xor         ebx,ebx  
00007FF67FEC3522  test        r12d,r12d  
00007FF67FEC3525  jle         get_id+0F7h (07FF67FEC3547h)  
00007FF67FEC3527  nop         word ptr [rax+rax]  
            ecs_add_id(world, entities[e], ids[i]);
00007FF67FEC3530  mov         r8,qword ptr [r14+rbx*8]  
00007FF67FEC3534  mov         rcx,rsi  
00007FF67FEC3537  mov         rdx,qword ptr [rdi]  
00007FF67FEC353A  call        flecs_add_id (07FF67FF64930h)  
        for (int i = 0; i < id_count; i ++) {
00007FF67FEC353F  inc         rbx  
00007FF67FEC3542  cmp         rbx,r12  
00007FF67FEC3545  jl          get_id+0E0h (07FF67FEC3530h)  

    for (int e = 0; e < ENTITY_COUNT; e ++) {
00007FF67FEC3547  add         rdi,8  
00007FF67FEC354B  sub         r15,1  
00007FF67FEC354F  jne         get_id+0D0h (07FF67FEC3520h)  
        }
    }

    bench_t b = bench_begin(label, ENTITY_COUNT * id_count);
00007FF67FEC3551  mov         rax,qword ptr [label]  
00007FF67FEC3559  lea         rcx,[rsp+40h]  
00007FF67FEC355E  mov         qword ptr [rsp+58h],rax  
00007FF67FEC3563  xorps       xmm0,xmm0  
00007FF67FEC3566  imul        eax,r12d,3E8h  
00007FF67FEC356D  xorps       xmm1,xmm1  
00007FF67FEC3570  movdqu      xmmword ptr [rsp+38h],xmm0  
00007FF67FEC3576  mov         dword ptr [rsp+64h],r15d  
00007FF67FEC357B  movdqu      xmmword ptr [rsp+48h],xmm1  
00007FF67FEC3581  mov         qword ptr [rsp+30h],1F4h  
00007FF67FEC358A  mov         dword ptr [rsp+60h],eax  
00007FF67FEC358E  call        qword ptr [ecs_os_api+0D8h (07FF67FF98B68h)]  
00007FF67FEC3594  movups      xmm0,xmmword ptr [rsp+30h]  
00007FF67FEC3599  movups      xmm1,xmmword ptr [rsp+40h]  
00007FF67FEC359E  movups      xmmword ptr [b],xmm0  
00007FF67FEC35A3  movups      xmm0,xmmword ptr [rsp+50h]  
00007FF67FEC35A8  movups      xmmword ptr [rsp+78h],xmm1  
00007FF67FEC35AD  movsd       xmm1,mmword ptr [rsp+60h]  
00007FF67FEC35B3  movups      xmmword ptr [rsp+88h],xmm0  
00007FF67FEC35BB  movsd       mmword ptr [rsp+98h],xmm1  
    do {
        for (int e = 0; e < ENTITY_COUNT; e ++) {
            for (int i = 0; i < id_count; i ++) {
                ecs_get_id(world, entities[e], ids[i]);
            }
        }
    } while (bench_next(&b));
00007FF67FEC35C4  lea         rcx,[b]  
00007FF67FEC35C9  call        bench_next (07FF67FEC52F0h)  
00007FF67FEC35CE  test        al,al  
00007FF67FEC35D0  jne         get_id+174h (07FF67FEC35C4h)  
    bench_end(&b);
00007FF67FEC35D2  lea         rcx,[b]  
00007FF67FEC35D7  call        bench_end (07FF67FEC4FF0h)  

    ecs_fini(world);
00007FF67FEC35DC  mov         rcx,rsi  
00007FF67FEC35DF  call        ecs_fini (07FF67FF6FC10h)  
    ecs_os_free(entities);
00007FF67FEC35E4  mov         rcx,r13  
00007FF67FEC35E7  call        qword ptr [ecs_os_api+28h (07FF67FF98AB8h)]  
    ecs_os_free(ids);
00007FF67FEC35ED  mov         rcx,r14  
00007FF67FEC35F0  call        qword ptr [ecs_os_api+28h (07FF67FF98AB8h)]  
}

Ignoring all the init and fini code and focusing on the main loop we get.

    bench_t b = bench_begin(label, ENTITY_COUNT * id_count);
// Construct `bench_t`
00007FF67FEC3551  mov         rax,qword ptr [label]  
00007FF67FEC3559  lea         rcx,[rsp+40h]  
00007FF67FEC355E  mov         qword ptr [rsp+58h],rax  
00007FF67FEC3563  xorps       xmm0,xmm0  
00007FF67FEC3566  imul        eax,r12d,3E8h  
00007FF67FEC356D  xorps       xmm1,xmm1  
00007FF67FEC3570  movdqu      xmmword ptr [rsp+38h],xmm0  
00007FF67FEC3576  mov         dword ptr [rsp+64h],r15d  
00007FF67FEC357B  movdqu      xmmword ptr [rsp+48h],xmm1  
00007FF67FEC3581  mov         qword ptr [rsp+30h],1F4h  
00007FF67FEC358A  mov         dword ptr [rsp+60h],eax  
// start the timer
00007FF67FEC358E  call        qword ptr [ecs_os_api+0D8h (07FF67FF98B68h)]  

00007FF67FEC3594  movups      xmm0,xmmword ptr [rsp+30h]  
00007FF67FEC3599  movups      xmm1,xmmword ptr [rsp+40h]  
00007FF67FEC359E  movups      xmmword ptr [b],xmm0  
00007FF67FEC35A3  movups      xmm0,xmmword ptr [rsp+50h]  
00007FF67FEC35A8  movups      xmmword ptr [rsp+78h],xmm1  
00007FF67FEC35AD  movsd       xmm1,mmword ptr [rsp+60h]  
00007FF67FEC35B3  movups      xmmword ptr [rsp+88h],xmm0  
00007FF67FEC35BB  movsd       mmword ptr [rsp+98h],xmm1  
    do {
        for (int e = 0; e < ENTITY_COUNT; e ++) {
            for (int i = 0; i < id_count; i ++) {
                ecs_get_id(world, entities[e], ids[i]);
            }
        }
    } while (bench_next(&b));
// execute do {} while(bench_next())
00007FF67FEC35C4  lea         rcx,[b]  
00007FF67FEC35C9  call        bench_next (07FF67FEC52F0h)  
00007FF67FEC35CE  test        al,al  
00007FF67FEC35D0  jne         get_id+174h (07FF67FEC35C4h) // Jump to 00007FF67FEC35C4  

This means its effectively just benchmarking the call to bench_next.

If I modify the loop to store the result of ecs_get_id, the compiler no longer optimizes away the loop.

const void *results[ENTITY_COUNT];

bench_t b = bench_begin(label, ENTITY_COUNT * id_count);
do {
    for (int e = 0; e < ENTITY_COUNT; e ++) {
        for (int i = 0; i < id_count; i ++) {
            results[e] = ecs_get_id(world, entities[e], ids[i]);
        }
    }
} while (bench_next(&b));
bench_end(&b);

do_not_optimize((uint32_t)(uintptr_t)results[0]);

I end with results that seem reasonable.

| Benchmark                             | Measurement  | Total (ns)   | Count          | Samples      | Interval     |
|---------------------------------------|--------------|--------------|----------------|--------------|--------------|
| get_id                                |      6.723ns |   2359510100 |      350976000 |            6 |        58496 |     2359510900 |            800 |
| get_16_ids                            |      6.787ns |   2380920200 |      350784000 |            6 |         3654 |     2380921200 |           1000 |

ZeroErrors avatar Jan 08 '24 03:01 ZeroErrors

I pulled some of the fixes out into a separate PR which doesn't change how the results are measured. https://github.com/SanderMertens/ecs_benchmark/pull/3

ZeroErrors avatar Jan 08 '24 14:01 ZeroErrors