ecs_benchmark
ecs_benchmark copied to clipboard
Detailed and CSV Output +
I ended up getting a little distracted and spent a bit more time on some other improvements 😄
Change Log
- Replaced
PRETTY_TIME_FMTwithbench_result_format_t- Added
--csvcommand line argument - Added
--detailedcommand line argument
- Added
- Improved output formatting
- Removed
(int)(36 - strlen(label))to instead use padding with printf (eg.%36s) - Removed
spacing(float v)and instead pad using printf (eg.%10.3f)
- Removed
- Replaced
WARMUP_INTERVALSwithWARMUP_INTERVALby using it as the initial value forbench_t::interval, this reduces the number of branches inbench_next(bench_t*). - Removed
double dtfrombench_tand instead storeecs_time_t start, stop, totaland changedbench_next(bench_t*)to start and stop for every sample, and accumulate the total for each sample to remove measurement overhead. - Renamed
bench_t::intervalstobench_t::samples - Added
do_not_optimize(uint32_t)This replaces usages of:
Also changed it to output toprintf("result = %u\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b", (uint32_t)result);stderrinstead ofstdoutso it doesn't interfere when you do./ecs_benchmark --csv --detailed > results.csv - Use the warmup time to calculate an appropriate interval to help reduce the overhead on fast benchmarks
Output Format
Output now has 4 different options.
Human Readable (Default)
| Benchmark | Measurement |
|---------------------------------------|--------------|
| baseline | 1.866ns |
| world_mini_fini | 699.110us |
| world_init_fini | 1.982ms |
| has_empty_entity | 2.218ns |
| has_id_not_found | 5.668ns |
| has_id | 6.543ns |
| has_16_ids | 6.787ns |
| get_empty_entity | 0.002ns |
| get_id_not_found | 0.002ns |
| get_id | 0.002ns |
| get_16_ids | 0.000ns |
| get_pair | 0.002ns |
| get_pair_16_targets | 0.000ns |
...
Human Readable Detailed
| Benchmark | Measurement | Total (ns) | Count | Samples |
|---------------------------------------|--------------|--------------|----------------|--------------|
| baseline | 1.844ns | 1000000100 | 542402100 | 10848042 |
| world_mini_fini | 717.889us | 1005044900 | 1400 | 28 |
| world_init_fini | 1.986ms | 1092249400 | 550 | 11 |
| has_empty_entity | 2.129ns | 1000031200 | 469650000 | 9393 |
| has_id_not_found | 5.622ns | 1000217700 | 177900000 | 3558 |
| has_id | 6.414ns | 1000257500 | 155950000 | 3119 |
| has_16_ids | 6.584ns | 1000778800 | 152000000 | 190 |
| get_empty_entity | 0.002ns | 1000000100 | 517687300000 | 10353746 |
| get_id_not_found | 0.002ns | 1000000100 | 572414600000 | 11448292 |
| get_id | 0.002ns | 1000000100 | 566892550000 | 11337851 |
| get_16_ids | 0.000ns | 1000000100 | 9136676000000 | 11420845 |
| get_pair | 0.002ns | 1000000100 | 526970050000 | 10539401 |
| get_pair_16_targets | 0.000ns | 1000000100 | 8464068800000 | 10580086 |
| get_inherited_depth_1 | 0.002ns | 1000000100 | 572209650000 | 11444193 |
| get_inherited_depth_2 | 0.002ns | 1000000100 | 571049300000 | 11420986 |
| get_inherited_depth_16 | 0.002ns | 1000000100 | 529902650000 | 10598053 |
...
CSV
Benchmark,Measurement (ns)
baseline,1.896159
world_mini_fini,716582.357143
world_init_fini,1856217.454545
has_empty_entity,2.176629
has_id_not_found,5.602006
has_id,6.436829
has_16_ids,6.596657
get_empty_entity,0.001925
get_id_not_found,0.001751
get_id,0.001756
...
CSV Detailed
Benchmark,Measurement (ns),Total (ns),Count,Samples
baseline,1.865273,1000000100,536114500,10722290
world_mini_fini,753041.111111,1016605500,1350,27
world_init_fini,1973357.090909,1085346400,550,11
has_empty_entity,2.130968,1000063100,469300000,9386
has_id_not_found,5.588571,1000074700,178950000,3579
has_id,6.418141,1000267300,155850000,3117
has_16_ids,6.551260,1001032600,152800000,191
get_empty_entity,0.001891,1000000100,528785450000,10575709
get_id_not_found,0.001781,1000000100,561582750000,11231655
get_id,0.001764,1000000100,566837900000,11336758
get_16_ids,0.000109,1000000100,9145068000000,11431335
get_pair,0.001926,1000001400,519112600000,10382252
...
Hm, the example outputs don't look right:
| get_empty_entity | 0.002ns |
| get_id_not_found | 0.002ns |
| get_id | 0.002ns |
| get_16_ids | 0.000ns |
| get_pair | 0.002ns |
| get_pair_16_targets | 0.000ns |
These operations definitely take longer than 0.002 nanoseconds 😅
I'm a bit wary of changing the way things are measured. Do you have indications that this increases the stability of the measurements? On my hardware measurements were already pretty stable between runs.
hmm, I'm getting this data pretty consistently but I'll do some more digging.
Edit: I found the cause of the large disparity, but will continue doing more testing.
Master: Running on master I get values like this.
| Benchmark | Measurement |
|---------------------------------------|--------------|
| get_empty_entity | 0.12ns |
| get_id_not_found | 0.14ns |
| get_id | 7.27ns |
| get_16_ids | 0.13ns |
| get_pair | 0.32ns |
| get_pair_16_targets | 1.71ns |
Tweaked it to include wall time which is a timer measured from bench_begin to bench_end to get a bit more info. (Edit: this includes the warmup time so its not very accurate for benchmarks that take longer to run)
| Benchmark | Measurement | dt (ns) | Count | Intervals | Bench Count | Wall Time |
|---------------------------------------|--------------|----------------|----------------|----------------|----------------|----------------|
| get_empty_entity | 0.51ns | 500000100.0000 | 971840192 | 12732540 | 1000 | 500013100
| get_id_not_found | 0.41ns | 500000100.0000 | 1224271008 | 13081186 | 1000 | 500000200
| get_id | 0.12ns | 500000100.0000 | 4190255600 | 12968707 | 1000 | 500000500
| get_16_ids | 0.80ns | 500000100.0000 | 625412864 | 12971583 | 16000 | 500000400
| get_pair | 0.14ns | 500000100.0000 | 3471257488 | 12696629 | 1000 | 500000300
| get_pair_16_targets | 0.16ns | 500000100.0000 | 3213355008 | 12899656 | 16000 | 500000400
This PR: Running with the changes in this PR I get something like the following. This also is modified to add an additional column, wall time.
| Benchmark | Measurement | Total (ns) | Count | Samples | Interval | Wall Time (ns) |
|---------------------------------------|--------------|--------------|----------------|--------------|--------------|----------------|
| get_empty_entity | 0.001ns | 506450800 | 450000000000 | 9 | 50000000 | 506452500 |
| get_id_not_found | 0.001ns | 504620300 | 450000000000 | 9 | 50000000 | 504622200 |
| get_id | 0.002ns | 550030300 | 350000000000 | 7 | 50000000 | 550031700 |
| get_16_ids | 0.000ns | 551502200 | 5600000000000 | 7 | 50000000 | 551503600 |
| get_pair | 0.001ns | 557665700 | 450000000000 | 9 | 50000000 | 557667600 |
| get_pair_16_targets | 0.000ns | 560586900 | 7200000000000 | 9 | 50000000 | 560588600 |
So looking at get_id since it has the largest variance.
Master:
bench_count = 1000
measurement = 0.12ns
total = 500000100.0000ns
count = 4190255600
intervals = 12968707
wall_time = 500000500ns
count = intervals * MEASURE_INTERVAL * bench_count
count = 12968707 * 50 * 1000
count = 648435350000 != 4190255600 // WRONG: !!!!! >32-bit
measurement_wall_time = wall_time / count
measurement_wall_time_fixed = 500000500ns / 648435350000 // Using the above calculated value
measurement_wall_time_fixed = 0.0007710876651ns
measurement_wall_time_wrong = 500000500ns / 4190255600 // Using the wrong value
measurement_wall_time_wrong = 0.119324582491ns == 0.12ns
measurement = total / count
measurement_fixed = 500000100.0000ns / 648435350000 // Using the above calculated value
measurement_fixed = 0.000771087048231ns
measurement_wrong = 500000100.0000ns / 4190255600 // Using the wrong value
measurement_wrong = 0.119324487031ns == 0.12ns
This code seems to be the cause.
= b->dt / ((b->intervals - WARMUP_INTERVALS) * MEASURE_INTERVAL * b->count)
= double / ((uint32_t - int) * int * uint32_t)
= double / (uint32_t * int * uint32_t)
= double / uint32_t
= double
If I change it to cast the intervals to uint64_t eg. b->dt / (((uint64_t)b->intervals - WARMUP_INTERVALS) * MEASURE_INTERVAL * b->count) I get a measurement around 0.00076563ns instead, which is actually lower than what my changes show.. hmm...
This PR:
bench_count = 1000
measurement = 0.002ns
total = 546608600ns
count = 350000000000
wall_time = 546610100ns
count = samples * sample_interval * bench_count
count = 7 * 50000000 * 1000
count = 350000000000 // Expected
measurement_wall_time = wall_time / count
measurement_wall_time = 546610100ns / 350000000000
measurement_wall_time = 0.00156174314286ns == 0.002ns // Expected
measurement = total / count
measurement = 546608600ns / 350000000000
measurement = 0.00156173885714ns == 0.002ns // Expected
Any measurement that's below one nanosecond is very suspect. Try to cross-reference the results with the README, they should be roughly in the same ballpark.
Just as a note, the machine I've been running these tests on is Windows and the timers only have a resolution of 100ns which is also visible in the results. Only way to reduce this impact is with a larger measure time with less samples. (I've started testing locally with a MEASURE_TIME of 2.0 to try and help with that).
Also the reason why the number is so low for me looks like compiler optimizations.
The benchmark function for get_id("get_id", 1); gets compiled to the following.
Disassembly
void get_id(const char *label, int32_t id_count) {
00007FF67FEC3450 mov qword ptr [rsp+10h],rbx
00007FF67FEC3455 mov qword ptr [rsp+8],rcx
00007FF67FEC345A push rbp
00007FF67FEC345B push rsi
00007FF67FEC345C push rdi
00007FF67FEC345D push r12
00007FF67FEC345F push r13
00007FF67FEC3461 push r14
00007FF67FEC3463 push r15
00007FF67FEC3465 sub rsp,0A0h
00007FF67FEC346C movsxd r12,edx
ecs_world_t *world = ecs_mini();
00007FF67FEC346F call ecs_mini (07FF67FF6F4E0h)
ecs_entity_t *entities = create_ids(world, ENTITY_COUNT, 0, false);
00007FF67FEC3474 mov ecx,1F40h
ecs_world_t *world = ecs_mini();
00007FF67FEC3479 mov rsi,rax
ecs_entity_t *entities = create_ids(world, ENTITY_COUNT, 0, false);
00007FF67FEC347C call qword ptr [ecs_os_api+20h (07FF67FF98AB0h)]
00007FF67FEC3482 xor ebx,ebx
00007FF67FEC3484 mov r15d,3E8h
00007FF67FEC348A mov r13,rax
00007FF67FEC348D nop dword ptr [rax]
00007FF67FEC3490 mov rcx,rsi
00007FF67FEC3493 call ecs_new_id (07FF67FF6DAB0h)
00007FF67FEC3498 mov qword ptr [r13+rbx*8],rax
00007FF67FEC349D inc rbx
00007FF67FEC34A0 cmp rbx,r15
00007FF67FEC34A3 jl get_id+40h (07FF67FEC3490h)
ecs_entity_t *ids = create_ids(world, id_count, 4, true);
00007FF67FEC34A5 test r12d,r12d
00007FF67FEC34A8 je get_id+0C5h (07FF67FEC3515h)
00007FF67FEC34AA lea ecx,[r12*8]
00007FF67FEC34B2 call qword ptr [ecs_os_api+20h (07FF67FF98AB0h)]
00007FF67FEC34B8 mov r14,rax
00007FF67FEC34BB test r12d,r12d
00007FF67FEC34BE jle get_id+0C8h (07FF67FEC3518h)
00007FF67FEC34C0 mov rbx,rax
00007FF67FEC34C3 mov edi,r12d
00007FF67FEC34C6 mov rcx,rsi
00007FF67FEC34C9 call ecs_new_low_id (07FF67FF6FAF0h)
00007FF67FEC34CE mov qword ptr [rbx],rax
00007FF67FEC34D1 lea rcx,[rsp+0F0h]
00007FF67FEC34D9 mov r9d,8
00007FF67FEC34DF mov qword ptr [rsp+20h],rcx
00007FF67FEC34E4 mov rdx,rax
00007FF67FEC34E7 mov dword ptr [rsp+0F0h],4
00007FF67FEC34F2 mov rcx,rsi
00007FF67FEC34F5 mov dword ptr [rsp+0F4h],4
00007FF67FEC3500 lea r8d,[r9-7]
00007FF67FEC3504 call ecs_set_id (07FF67FF71070h)
00007FF67FEC3509 lea rbx,[rbx+8]
00007FF67FEC350D sub rdi,1
00007FF67FEC3511 jne get_id+76h (07FF67FEC34C6h)
00007FF67FEC3513 jmp get_id+0C8h (07FF67FEC3518h)
00007FF67FEC3515 xor r14d,r14d
for (int i = 0; i < id_count; i ++) {
00007FF67FEC3518 mov rdi,r13
00007FF67FEC351B nop dword ptr [rax+rax]
00007FF67FEC3520 xor ebx,ebx
00007FF67FEC3522 test r12d,r12d
00007FF67FEC3525 jle get_id+0F7h (07FF67FEC3547h)
00007FF67FEC3527 nop word ptr [rax+rax]
ecs_add_id(world, entities[e], ids[i]);
00007FF67FEC3530 mov r8,qword ptr [r14+rbx*8]
00007FF67FEC3534 mov rcx,rsi
00007FF67FEC3537 mov rdx,qword ptr [rdi]
00007FF67FEC353A call flecs_add_id (07FF67FF64930h)
for (int i = 0; i < id_count; i ++) {
00007FF67FEC353F inc rbx
00007FF67FEC3542 cmp rbx,r12
00007FF67FEC3545 jl get_id+0E0h (07FF67FEC3530h)
for (int e = 0; e < ENTITY_COUNT; e ++) {
00007FF67FEC3547 add rdi,8
00007FF67FEC354B sub r15,1
00007FF67FEC354F jne get_id+0D0h (07FF67FEC3520h)
}
}
bench_t b = bench_begin(label, ENTITY_COUNT * id_count);
00007FF67FEC3551 mov rax,qword ptr [label]
00007FF67FEC3559 lea rcx,[rsp+40h]
00007FF67FEC355E mov qword ptr [rsp+58h],rax
00007FF67FEC3563 xorps xmm0,xmm0
00007FF67FEC3566 imul eax,r12d,3E8h
00007FF67FEC356D xorps xmm1,xmm1
00007FF67FEC3570 movdqu xmmword ptr [rsp+38h],xmm0
00007FF67FEC3576 mov dword ptr [rsp+64h],r15d
00007FF67FEC357B movdqu xmmword ptr [rsp+48h],xmm1
00007FF67FEC3581 mov qword ptr [rsp+30h],1F4h
00007FF67FEC358A mov dword ptr [rsp+60h],eax
00007FF67FEC358E call qword ptr [ecs_os_api+0D8h (07FF67FF98B68h)]
00007FF67FEC3594 movups xmm0,xmmword ptr [rsp+30h]
00007FF67FEC3599 movups xmm1,xmmword ptr [rsp+40h]
00007FF67FEC359E movups xmmword ptr [b],xmm0
00007FF67FEC35A3 movups xmm0,xmmword ptr [rsp+50h]
00007FF67FEC35A8 movups xmmword ptr [rsp+78h],xmm1
00007FF67FEC35AD movsd xmm1,mmword ptr [rsp+60h]
00007FF67FEC35B3 movups xmmword ptr [rsp+88h],xmm0
00007FF67FEC35BB movsd mmword ptr [rsp+98h],xmm1
do {
for (int e = 0; e < ENTITY_COUNT; e ++) {
for (int i = 0; i < id_count; i ++) {
ecs_get_id(world, entities[e], ids[i]);
}
}
} while (bench_next(&b));
00007FF67FEC35C4 lea rcx,[b]
00007FF67FEC35C9 call bench_next (07FF67FEC52F0h)
00007FF67FEC35CE test al,al
00007FF67FEC35D0 jne get_id+174h (07FF67FEC35C4h)
bench_end(&b);
00007FF67FEC35D2 lea rcx,[b]
00007FF67FEC35D7 call bench_end (07FF67FEC4FF0h)
ecs_fini(world);
00007FF67FEC35DC mov rcx,rsi
00007FF67FEC35DF call ecs_fini (07FF67FF6FC10h)
ecs_os_free(entities);
00007FF67FEC35E4 mov rcx,r13
00007FF67FEC35E7 call qword ptr [ecs_os_api+28h (07FF67FF98AB8h)]
ecs_os_free(ids);
00007FF67FEC35ED mov rcx,r14
00007FF67FEC35F0 call qword ptr [ecs_os_api+28h (07FF67FF98AB8h)]
}
Ignoring all the init and fini code and focusing on the main loop we get.
bench_t b = bench_begin(label, ENTITY_COUNT * id_count);
// Construct `bench_t`
00007FF67FEC3551 mov rax,qword ptr [label]
00007FF67FEC3559 lea rcx,[rsp+40h]
00007FF67FEC355E mov qword ptr [rsp+58h],rax
00007FF67FEC3563 xorps xmm0,xmm0
00007FF67FEC3566 imul eax,r12d,3E8h
00007FF67FEC356D xorps xmm1,xmm1
00007FF67FEC3570 movdqu xmmword ptr [rsp+38h],xmm0
00007FF67FEC3576 mov dword ptr [rsp+64h],r15d
00007FF67FEC357B movdqu xmmword ptr [rsp+48h],xmm1
00007FF67FEC3581 mov qword ptr [rsp+30h],1F4h
00007FF67FEC358A mov dword ptr [rsp+60h],eax
// start the timer
00007FF67FEC358E call qword ptr [ecs_os_api+0D8h (07FF67FF98B68h)]
00007FF67FEC3594 movups xmm0,xmmword ptr [rsp+30h]
00007FF67FEC3599 movups xmm1,xmmword ptr [rsp+40h]
00007FF67FEC359E movups xmmword ptr [b],xmm0
00007FF67FEC35A3 movups xmm0,xmmword ptr [rsp+50h]
00007FF67FEC35A8 movups xmmword ptr [rsp+78h],xmm1
00007FF67FEC35AD movsd xmm1,mmword ptr [rsp+60h]
00007FF67FEC35B3 movups xmmword ptr [rsp+88h],xmm0
00007FF67FEC35BB movsd mmword ptr [rsp+98h],xmm1
do {
for (int e = 0; e < ENTITY_COUNT; e ++) {
for (int i = 0; i < id_count; i ++) {
ecs_get_id(world, entities[e], ids[i]);
}
}
} while (bench_next(&b));
// execute do {} while(bench_next())
00007FF67FEC35C4 lea rcx,[b]
00007FF67FEC35C9 call bench_next (07FF67FEC52F0h)
00007FF67FEC35CE test al,al
00007FF67FEC35D0 jne get_id+174h (07FF67FEC35C4h) // Jump to 00007FF67FEC35C4
This means its effectively just benchmarking the call to bench_next.
If I modify the loop to store the result of ecs_get_id, the compiler no longer optimizes away the loop.
const void *results[ENTITY_COUNT];
bench_t b = bench_begin(label, ENTITY_COUNT * id_count);
do {
for (int e = 0; e < ENTITY_COUNT; e ++) {
for (int i = 0; i < id_count; i ++) {
results[e] = ecs_get_id(world, entities[e], ids[i]);
}
}
} while (bench_next(&b));
bench_end(&b);
do_not_optimize((uint32_t)(uintptr_t)results[0]);
I end with results that seem reasonable.
| Benchmark | Measurement | Total (ns) | Count | Samples | Interval |
|---------------------------------------|--------------|--------------|----------------|--------------|--------------|
| get_id | 6.723ns | 2359510100 | 350976000 | 6 | 58496 | 2359510900 | 800 |
| get_16_ids | 6.787ns | 2380920200 | 350784000 | 6 | 3654 | 2380921200 | 1000 |
I pulled some of the fixes out into a separate PR which doesn't change how the results are measured. https://github.com/SanderMertens/ecs_benchmark/pull/3