stb_sprintf: support for "format" being a non-zero terminated string view?
Note: this is NOT a complete patch, more like opening a discussion.
To improve support for languages like Rust (and a growing tendency for C++ users to support string view/spans) we've been working toward supporting non-zero terminated strings in Dear ImGui. I found it has many extra advantages since many high-level languages and string types already have a length field calculated, and for literals the strlen() call can be done at compile-time.
We have a working branch but one niggle is that the sprintf functions we rely on (stdlib one or stb_sprintf) need a zero terminator for the format string parameter, which forces us to copy and zero-terminate them all. (I'm specifically taking about the format string itself needing to be zero-terminated, I am aware we can use for %.*s for other string arguments). Some of our functions tend to lead users to pass long decorated format strings (e.g. ImGui::Text("Hello world %d %s", ...) being a prime candidate for people sending large amount of text in a format string).
So I investigated the possibility of completely switching to embedding stb_sprintf and patching it to support this. Note that I'm not already using stb_printf, but I know some Dear ImGui are using it.
It's not a difficult change to implement BUT as I think stb_sprintf was designed for performances I mostly worry that the change of logic may have meaningful side effects.
If I was wondering if you or Jeff had unpublished perf tests you could run?
What this patch does:
- At the top of the
vsprintfcbcode it addsconst char* f_end = f + strlen(fmt);and then for the entire function makes small changes to use that pointer as an end marker and not rely on zero-termination, aka in theory it should never read fromf_end[0]. - This is a WIP patch solely designed to measure the potential side-effects (perf wise) of changing the main code.
- Hastily coded and not thoroughly tested, but pass test_sprintf (which btw has a failing test on 32-bit arch, commented that one out) but there aren't many things tested.
Performance consideration:
- Adding various (f < f_end) tests in the main function may add up overhead but I don't imagine so much. Might be beneficial to tweak that (e.g. use a decrementing counter?). people like jeff/ryg/sean may have better instinct what to do as I don't really know much about cpu/compilers.
- The strlen() is going to touch all of the memory early on, may not be so much of an issue as cache would be primed for the main work? Pathological case if format string is very large (e.g. larger than cache)?
If there a decision that perfs were ok, and that we could move onward: The remaining work would be, ihmo:
- Add a new function e.g.
_vsprintfcbn()withconst char* fmt, const char* fmt_endparameter (orconst char* fmt, size_t fmt_len, either way would work). - All other functions instead of calling
_vsprintfcb()would call_vsprintfcbn()with a strlen(). - Can we come up with some tests to validate that we are not reading past the limit? (allocate near page boundaries, with end-marker just at the frontier?)
- Make sure I didn't break anything.
(tagging @simonvanbernem who seems interested too)
I ran some indicative general Dear ImGui perf tests.
- There aren't designed as sprintf tests, more like testing average use of dear imgui. "perf_demo_all" is opening all of dear imgui demo/examples windows, which would lead to thousands of calls to sprintf and also a million other things happening. The other tests are spamming thousands of widgets (stuff like "perf_stress_button" is not expected to call sprintf at all, so it's a neutral test, the "coloredit" and "slider" ones would call sprintf)
- Our perf test framework may not be super precise not great. I manually removed a few outliers from the samples.
green = Using msvc2019 sprintf blue = using stb_sprintf orange = using stb_printf with my patch
That's merely there to give a rough confirmation that: using stb_sprintf in dear imgui makes a small but measurable improvement, my patch seems to slow things down a little (but maybe some RAD magic can improve that).
x64 release

x64 "debug" (everything is slower, printf impact gets drowned)

There are unpublished tests, but I don't know if they're performance tests or just correctness. I'll ask Jeff.
Ok, clearly "rrsprintfspeed.c" is a speed test (rrsprintf is the RAD-internal name for stb_sprintf). I'll see if I can get it building this weekend, although I also can just send it to you if you want to try it sooner. It does stuff like exhautisvely tests all non-NAN floats, don't know how long it takes to run.
Thanks for looking! If at a glance it looks like something you think I'll be able to build/run I don't mind trying.
(omar AT dearimgui dot org)
sent
I ran the tests sent by Sean (they more or less run through all possible values for many types).
Here's one of the run: (stbsp_sprintf = vanilla, stbsp_sprintf2 my changes)
Due to how I ran tests, sprintf was tested twice, left that in the samples.
(you may skim this part)
sprintf time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 152775
sprintf time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 152395
stbsp_sprintf time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 2575
stbsp_sprintf2 time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 6508
sprintf time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 152651
sprintf time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 152328
stbsp_sprintf time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 2558
stbsp_sprintf2 time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 6391
sprintf time for '%d': 68
sprintf time for '%d': 69
stbsp_sprintf time for '%d': 26
stbsp_sprintf2 time for '%d': 27
sprintf time for '%24d': 171346
sprintf time for '%24d': 167891
stbsp_sprintf time for '%24d': 41285
stbsp_sprintf2 time for '%24d': 39217
sprintf time for '%x': 82907
sprintf time for '%x': 82572
stbsp_sprintf time for '%x': 27779
stbsp_sprintf2 time for '%x': 28397
sprintf time for '%08x': 137956
sprintf time for '%08x': 133784
stbsp_sprintf time for '%08x': 30877
stbsp_sprintf2 time for '%08x': 29870
sprintf time for '%f': 76285
sprintf time for '%f': 76139
stbsp_printf time for '%f': 33368
stbsp_printf2 time for '%f': 33988
sprintf time for '%e': 77539
sprintf time for '%e': 78304
stbsp_printf time for '%e': 34007
stbsp_printf2 time for '%e': 34195
sprintf time for '%g': 78729
sprintf time for '%g': 80720
stbsp_printf time for '%g': 33007
stbsp_printf2 time for '%g': 33386
sprintf time for '%f l': 125701
sprintf time for '%f l': 123365
stbsp_sprintf time for '%f l': 31882
stbsp_sprintf2 time for '%f l': 33326
sprintf time for '%f b': 507478
sprintf time for '%f b': 483318
stbsp_sprintf time for '%f b': 25873
stbsp_sprintf2 time for '%f b': 27983
sprintf time for '%e l': 129690
sprintf time for '%e l': 128700
stbsp_sprintf time for '%e l': 34656
stbsp_sprintf2 time for '%e l': 34839
sprintf time for '%e b': 111196
sprintf time for '%e b': 111247
stbsp_sprintf time for '%e b': 34471
stbsp_sprintf2 time for '%e b': 33943
sprintf time for '%.320f': 495866
sprintf time for '%.320f': 468650
stbsp_sprintf time for '%.320f': 22945
stbsp_sprintf2 time for '%.320f': 22968
sprintf time for '%a': 51489
sprintf time for '%a': 54392
stbsp_sprintf time for '%a': 17358
stbsp_sprintf2 time for '%a': 17036
sprintf time for '%I64d': 80013
sprintf time for '%I64d': 79855
stbsp_sprintf time for '%I64d': 25714
stbsp_sprintf2 time for '%I64d': 28589
sprintf time for '%I64d': 83257
sprintf time for '%I64d': 83446
stbsp_sprintf time for '%I64d': 16415
stbsp_sprintf2 time for '%I64d': 17243
sprintf time for '%s%s%s': 10314
sprintf time for '%s%s%s': 9861
stbsp_sprintf time for '%s%s%s': 5652
stbsp_sprintf2 time for '%s%s%s': 4904
sprintf time for '0123456789012345678901234567890123456789012345678901234567890123%d': 234932
sprintf time for '0123456789012345678901234567890123456789012345678901234567890123%d': 235027
stbsp_sprintf time for '0123456789012345678901234567890123456789012345678901234567890123%d': 10493
stbsp_sprintf2 time for '0123456789012345678901234567890123456789012345678901234567890123%d': 14550
I did other runs to narrow down the noise. The meaningful differences are on those three tests:
stbsp_sprintf time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 2540
stbsp_sprintf time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 2547
stbsp_sprintf time for '0123456789012345678901234567890123456789012345678901234567890123%d': 10275
stbsp_sprintf2 time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 6681
stbsp_sprintf2 time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 6596
stbsp_sprintf2 time for '0123456789012345678901234567890123456789012345678901234567890123%d': 14625
Difference basically amount to the code that copies the unformatted chunks of text until finding a %.
Technically speaking the tests don't spend too much time measuring parsing of many options/flags. I believe my changes would make that part also slower but in the grand scheme of this that time should be not important compared to A) the dumb-copy throughput, and B) general formatted output (for which the code is unchanged)
So basically it comes down to optimize that one loop:
Original
// fast copy everything up to the next % (or end of string)
for (;;) {
while (((stbsp__uintptr)f) & 3) {
schk1:
if (f[0] == '%')
goto scandd;
schk2:
if (f[0] == 0)
goto endfmt;
stbsp__chk_cb_buf(1);
*bf++ = f[0];
++f;
}
for (;;) {
// Check if the next 4 bytes contain %(0x25) or end of string.
// Using the 'hasless' trick:
// https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
stbsp__uint32 v, c;
v = *(stbsp__uint32 *)f;
c = (~v) & 0x80808080;
if (((v ^ 0x25252525) - 0x01010101) & c)
goto schk1;
if ((v - 0x01010101) & c)
goto schk2;
if (callback)
if ((STB_SPRINTF_MIN - (int)(bf - buf)) < 4)
goto schk1;
#ifdef STB_SPRINTF_NOUNALIGNED
if(((stbsp__uintptr)bf) & 3) {
bf[0] = f[0];
bf[1] = f[1];
bf[2] = f[2];
bf[3] = f[3];
} else
#endif
{
*(stbsp__uint32 *)bf = v;
}
bf += 4;
f += 4;
}
}
scandd:
My initial version (slower)
// fast copy everything up to the next % (or end of string)
for (;;) {
while (((stbsp__uintptr)f) & 3) {
schk1:
if (f == f_end)
goto endfmt;
schk2:
if (f[0] == '%')
goto scandd;
stbsp__chk_cb_buf(1);
*bf++ = f[0];
++f;
}
for (;;) {
// Check if the next 4 bytes contain %(0x25) or end of string.
// Using the 'hasless' trick:
// https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
stbsp__uint32 v, c;
v = *(stbsp__uint32 *)f;
c = (~v) & 0x80808080;
if (((v ^ 0x25252525) - 0x01010101) & c)
goto schk2;
if (f_end - f < 4)
goto schk1;
if (callback)
if ((STB_SPRINTF_MIN - (int)(bf - buf)) < 4)
goto schk1;
#ifdef STB_SPRINTF_NOUNALIGNED
if(((stbsp__uintptr)bf) & 3) {
bf[0] = f[0];
bf[1] = f[1];
bf[2] = f[2];
bf[3] = f[3];
} else
#endif
{
*(stbsp__uint32 *)bf = v;
}
bf += 4;
f += 4;
}
}
scandd:
I'll see if I can come up with an optimization...
The way you swapped the goto-Labels and the two checks in the first loop, we won't check for end-of-string anymore, if the "if (((v ^ 0x25252525) - 0x01010101) & c)" branch is taken. Is this intended?
There are two cases of performance change to look at:
- How much slower is the non-NUL-terminated new version vs. the old version of the library.
- How much slower is the NUL-terminated new version vs. the old version of the library.
I imagine for your personal use, you kind of care about case 1, but not THAT much since it's more a question of performance vs. your alternatives; NUL-terminated isn't an option for the new version. And you can always embed it into ImGui regardless so you don't need to worry about regressing stb_sprintf.h performance.
For the possibility of integrating this into stb_sprintf itself, case 2 matters, because we don't want to cause a significant performance loss for people using this library from C/C++, which is the primary use case.
I'm not 100% sure which case you're testing here, especially as your initial description seemed addressed to the possibility of embedding it rather than making a shippable PR? But presumably the test harness would be easier to adapt using the NUL interface.
So anyway, if you are testing case 2, another possibility is you're at least partly looking at the overhead of scanning the string twice; once for strlen() and once for non-NUL-sprintf(). I'm not 100% sure how realistic any of these tests are since the string is going to be fully in cache after the first iteration, but for the hot-case I would guess it's plausible that iterating the string twice could be a significant cost. If that IS the case (I guess I wouldn't expect it to triple the time though), one possible workaround would be to have the outer layer of stb_sprintf, which contains the strlen, to scan & copy up to the first % before doing the strlen and then hand off to sprintf. (Actually, should handle %% as well.) This wouldn't cover cases where there's a lot of unformatted text AFTER the first formatted item, but I don't see a good way to address that case and this would be pretty easy to do.
BTW, it looks like the tests don't have any cases like "%6.2f", which is the sort of thing I tend to use in profilers so might be an important case for you. Probably not significantly affected by the NUL stuff though so not worth worrying about here, but maybe worth testing against regular sprintf.
The way you swapped the goto-Labels and the two checks in the first loop, we won't check for end-of-string anymore, if the "if (((v ^ 0x25252525) - 0x01010101) & c)" branch is taken. Is this intended?
Right, if within the same 4 bytes there are both a % and end of string, the first iteration won't check for end-of-string, so I guess the goto schk2 can be changed to goto schk1, my bad.
I'm not 100% sure which case you're testing here, especially as your initial description seemed addressed to the possibility of embedding it rather than making a shippable PR? But presumably the test harness would be easier to adapt using the NUL interface.
In the current test I'm adding a strlen() so I'm essentially testing case 2, as the assumption is that we'd only have one version of the inner-loop which would function with an end pointer. Of course that's only a viable assumption is we can make keep it fast enough.
(Case 1 will always be a little faster than case 2 for the stb_printf call tree since there's no strlen)
Thanks for the suggestion, that's a good idea. First I'm going to see if I can make the existing loop as fast as previous version.
BTW, it looks like the tests don't have any cases like "%6.2f", which is the sort of thing I tend to use in profilers so might be an important case for you
Right. Adding to my test cases. As an indication:
sprintf time for '%6.2f': 54103
stbsp_printf time for '%6.2f': 16920
stbsp_printf2 time for '%6.2f': 17275
EDIT: Using stbsp__strlen() (simplified version of stbsp__strlen_limited() instead of strlen yield measurable wins.
original zero-terminated
stbsp_sprintf time 2540 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2547 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf time 10275 for '0123456789012345678901234567890123456789012345678901234567890123%d'
vs new zero-terminated using strlen + non-zero terminated code
stbsp_sprintf2 time 3956 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf2 time 3968 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf2 time 11679 for '0123456789012345678901234567890123456789012345678901234567890123%d'
EDIT: Using
stbsp__strlen()(simplified version ofstbsp__strlen_limited()instead of strlen yield measurable wins. original zero-terminated
Kind of depressing since it means strlen() on whichever platform you're using isn't already doing four-at-a-time (or even SIMD).
Kind of depressing since it means strlen() on whichever platform you're using isn't already doing four-at-a-time (or even SIMD).
VS2015 and VS2019 "Release" build are inlining a dumb version when intrinsic are enabled....
int fmt_len = strlen(fmt);// stbsp__strlen(fmt);
00007FF75C8445EA 49 C7 C2 FF FF FF FF mov r10,0FFFFFFFFFFFFFFFFh
00007FF75C8445F1 49 FF C2 inc r10
00007FF75C8445F4 43 80 3C 10 00 cmp byte ptr [r8+r10],0
00007FF75C8445F9 75 F6 jne stbsp_vsnprintf+21h (07FF75C8445F1h)
Confirmed.... https://developercommunity.visualstudio.com/t/it-seems-that-visual-studios-cpp-compiler-creates/737594 https://developercommunity.visualstudio.com/t/msconnect-1218177-inlined-intrinsic-strlen-is-not/204272
EDIT: Amusingly it means that in "Debug" build even with the additional jumps hoops we end up in a faster version within ucrtbased.dll.... (it does it 8 at a times on my config)
I duplicated the copying loop (~42 lines) to provide both a zero-terminated and a pointer-terminated version.
fmt_end gets set to ~0 when using a zero-terminated function. The flag parsing code always test for f < fmt_end.
Technically speaking we could provide two versions of the whole function but this seems like a good compromise.
// v0 (vanilla stb_printf, zero-terminated input)
stbsp_sprintf time 2553 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2550 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_printf time 6596 for '%6.2f'
stbsp_sprintf time 10379 for '0123456789012345678901234567890123456789012345678901234567890123%d'
// v5 (dual path, zero-terminated input)
stbsp_sprintf time 2610 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2604 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_printf time 6606 for '%6.2f'
stbsp_sprintf time 9681 for '0123456789012345678901234567890123456789012345678901234567890123%d'
// v5 (dual path, not zero-terminated input, fmt_end provided by caller)
stbsp_sprintf time 2493 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2493 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_printf time 6634 for '%6.2f'
stbsp_sprintf time 10558 for '0123456789012345678901234567890123456789012345678901234567890123%d'
For reference, those two earlier versions are giving an indication of the cost of the extra stbsp_strlen() done everytime:
// v3 (zero-terminated input + code supporting non-zero-terminated + function always calling our strlen)
stbsp_sprintf time 3970 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 3970 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_printf time 6658 for '%6.2f'
stbsp_sprintf time 11809 for '0123456789012345678901234567890123456789012345678901234567890123%d'
// v4 (non-zero terminated input: with fmt_end provided by caller)
stbsp_sprintf time 2470 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2467 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_printf time 6642 for '%6.2f'
stbsp_sprintf time 9461 for '0123456789012345678901234567890123456789012345678901234567890123%d'
Difference between v3 and v4 is only that v4 is providing fmt_end at the call site (doing strlen once).
Will run entire test suite again but rough idea is that the typical zero-terminated path may generally be 3-5% slower with that change, while providing non-zero terminated support. Considering that stb_printf claims x35 speed compared to VS on long strings, and x5 to x12 on various data types, the +5% loss may be ok? (other solution is to compile-time define all of it)
Looking pretty close. I notice v5 non-nul terminated is now slower than the nul-teriminated for the unformatted strings case, is this still using your "initial version" of the copying loop for the fmt_end case, or have you made further optimizations to it?
the typical zero-terminated path may generally be 3-5% slower with that change, while providing non-zero terminated support. Considering that stb_printf claims x35 speed compared to VS on long strings, and x5 to x12 on various data types, the +5% loss may be ok?
I think I'd be generally ok with the perf lost for the feature gain, but some thoughts:
- my initial reaction was that this is a pretty frivolous feature and not worth a perf loss ever (given you could just embed your version yourself), but on reflection it's possibly interesting to have access to C-style printf from other languages AT ALL. In the early days of stb_image, one of the first hits for "stb_image" on google was a Haskell wrapper.
- my one caveat to the perf loss being acceptable is it only makes sense if there's not a lot of new code (i'm ok with changes). so two versions of a 42-line copy loop is a bit awkward, as is having our own strlen implementation. OTOH if we can get the slowdown closer to 0% I'd be willing to accept a bigger increase in size.
- but maybe if I give it some time to reflect on it, I'll be ok with it as is anyway
VS2015 and VS2019 "Release" build are inlining a dumb version when intrinsic are enabled....
This got me curious so I did a little digging. Short version: Turns out Microsoft only optimized strnlen for some reason, might be worth trying it in the context of this PR.
For some reason Microsoft is still using frumpy old hand-written assembly for their strlen implementation. (You can read the x64 version in C:\Program Files (x86)\Windows Kits\10\Source\10.0.19041.0\ucrt\string\amd64\strlen.asm)
Interestingly, strnlen is written in C++ and has alternate paths for AVX2 and SSE2. (C:\Program Files (x86)\Windows Kits\10\Source\10.0.19041.0\ucrt\string\strnlen.cpp)
I did a quick and dirty microbenchmark with strnlen imitating strlen (IE: strnlen(str, ((size_t)-1) - ((size_t)str));), and indeed strnlen is around 10 times faster in x64 release builds:
Testing with 55316 character string with 1000000 iterations...
strlen: 55316000000, 55316, time: 149116311 ticks total, 14911.631100 ns per iteration
strnlen: 55316000000, 55316, time: 14182812 ticks total, 1418.281200 ns per iteration
strnlen is 10.513875 faster than strlen
(Funnily enough, `strlen` still wins in debug builds.)
Testing with 55316 character string with 1000000 iterations...
strlen: 55316000000, 55316, time: 37479447 ticks total, 3747.944700 ns per iteration
strnlen: 55316000000, 55316, time: 142507789 ticks total, 14250.778900 ns per iteration
strnlen is 0.262999 faster than strlen
as is having our own strlen implementation.
That part got actually removed in v5, since the zero-terminating case doesn't need it anymore.
(fyi stb_printf already had stbsp__strlen_limited(), the one I added in v2 was a trimmed down version)
Quick stats
Current version goes from 1477 to 1544 lines (+67 lines) and introduce two new symbols vsprintfcbnz and sprintfnz. The later sprintfnz I only added to make it easier for benchmark (adds +11 lines because of the va_list cruft). We could either decide to only provide the new parameter in the "most-complete" function vsprintfcbnz() or add **nz versions to all of them.
Your question on the loop:
The non-zerminated loop has changed from initial version
// Check if the next 4 bytes contain %(0x25) or end of string.
// Using the 'hasless' trick:
// https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
stbsp__uint32 v, c;
v = *(stbsp__uint32 *)f;
c = (~v) & 0x80808080;
if (((v ^ 0x25252525) - 0x01010101) & c)
goto schk2;
if (f_end - f < 4)
goto schk1;
if (callback)
if ((STB_SPRINTF_MIN - (int)(bf - buf)) < 4)
goto schk2;
to
// Check if the next 4 bytes contain %(0x25) or end of string.
// Using the 'hasless' trick:
// https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
if (f_end - f < 4)
goto schk1;
stbsp__uint32 v, c;
v = *(stbsp__uint32 *)f;
c = (~v) & 0x80808080;
if (((v ^ 0x25252525) - 0x01010101) & c)
goto schk1;
if (callback)
if ((STB_SPRINTF_MIN - (int)(bf - buf)) < 4)
goto schk1;
As I could reliable measure a difference.
I can't well explain the result there (notice the last one has a %d at the end), but the measurements comes back generally the same every time I run them.
// v0 (vanilla)
stbsp_sprintf time 2556 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2551 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
// v5 (dual path, zero-terminated input)
stbsp_sprintf time 2610 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2604 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf time 9681 for '0123456789012345678901234567890123456789012345678901234567890123%d'
// v5 (dual path, not zero-terminated input, fmt_end provided by caller)
stbsp_sprintf time 2493 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2493 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf time 10558 for '0123456789012345678901234567890123456789012345678901234567890123%d'
Adding a %d in two first long strings:
// v0 (vanilla)
stbsp_sprintf time 2974 for 'stringstringstringstringstring%dstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2970 for 'strin stringstring tringstring tringstring%d tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
// v5 (dual path, zero-terminated input)
stbsp_sprintf time 2999 for 'stringstringstringstringstring%dstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2971 for 'strin stringstring tringstring tringstring%d tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
// v5 (dual path, not zero-terminated input, fmt_end provided by caller)
stbsp_sprintf time 2992 for 'stringstringstringstringstring%dstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2962 for 'strin stringstring tringstring tringstring%d tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
Current version goes from 1477 to 1544 lines (+67 lines) and introduce two new symbols
vsprintfcbnzandsprintfnz.
Sounds reasonable. (I wish we could get by with just the strlen() in front and all the code shared, but clearly that isn't happening.)
The later
sprintfnzI only added to make it easier for benchmark (adds +11 lines because of the va_list cruft). We could either decide to only provide the new parameter in the "most-complete" functionvsprintfcbnz()or add**nzversions to all of them.
I'm ok with having nz versions for all three functions. But there are five I think, so that brings us to another issue. I'd recommend only implementing versions of nz functions that take an explicit output length, but moreover, they probably should write nz output. So we probably need a way to return that output length?
I'd recommend only implementing versions of nz functions that take an explicit output length, but moreover, they probably should write nz output. So we probably need a way to return that output length?
Don't we already always return output length? (not sure I understand all the callback stuff and specs when not fitting). EDIT
char buf[256];
int n;
n = stbsp_snprintf(buf, 10, "hellohellohello");
printf("%d\n", n);
printf("'%s'\n", buf);
15
'hellohell'
Right, so I think it'd be great to add for completeness. But now I am getting paranoid about adding parameters (read below).
And this is what a compile-time option version would roughly look like https://github.com/nothings/stb/pull/1181/commits/60a999d14a80d34aa6d14c4bd6a5843d4afc5473
(
Except currently I am running into a hard to understand thing were the compiler starts optimizing differently based on something I don't understand. Spent the last hour combing through it, but basically if I revert to vanilla version and add int fmt_len parameter (UNUSED) + one replacement wrapping function:
STBSP__PUBLICDEC int STB_SPRINTF_DECORATE(vsprintfcbnz)(STBSP_SPRINTFCB* callback, void* user, char* buf, char const* fmt, int /*fmt_len*/, va_list va); + vsprintfcb() now calling vsprintfcbnz()
Then my numbers for some tests are instantly slower by 20%...
If I move fmt_len to first parameter I am back at the original speed. I assume there's some code reuse or tail-call thing happening with matching stack layout, but I'm surprised by the large difference and I don't really understand it. [Note that I am currently technically compiling in C++]
And so it happens on vanilla codebase but also adding that #ifdef stuff over v5 made it happen and I'm super puzzled.
EDIT: I was wondering why callback, user parameters were the first and not last ones in vsprintfcb() and I wonder if it is due to the same thing.
)
I'd recommend only implementing versions of nz functions that take an explicit output length, but moreover, they probably should write nz output. So we probably need a way to return that output length?
Don't we already always return output length? (not sure I understand all the callback stuff and specs when not fitting).
Oh yes, of course we do always return output length.
However, it probably includes the length of the terminating NUL. We should presumably engineer the nz API so it (a) returns the number of non-NUL characters, (b) doesn't write the NUL, and (c) if you give it a buffer of length 10, it will write 10 non-NUL characters to it. (We can presumably do that by lying about the buffer length to ourselves and then suppressing the terminating NUL.)
Right, so I think it'd be great to add for completeness. But now I am getting paranoid about adding parameters (read below).
I think we changed the behavior of this, I think it used to return the length written, but for compatibility with other sprintfs it was changed to always return the full required length? Hmm, no, that seems the opposite of what other sprintfs do. Pretty sure it changed at some point, or at least that people have complained about this.
But the nz versions can do their own thing regardless. Although they could return the full required length and let the caller do a min() on the passed-in length, we can also just change the semantics for nz and have it only give the full length on passing in NUL.
I think the argument for the current behavior is the scenario where you have a small buffer on the stack, and you'll malloc a large buffer if it's not enough. You want to call sprintf once with the stack, and if that's enough you're done. If that's not enough, you don't want to make ANOTHER sprintf call to find out the required length, malloc, and then make a THIRD sprintf call if you can do it all with two calls. (This sounds like a Jeff argument to me.)
Current behavior seems fine, and I recall the thread arguing for matching standard (I probably voted for that).
IMHO the only value/purpose of adding nz output support would to save 1 character and allow users to retrieve a 10-long string when providing a 10-long buffer.
It's possible but seems like it would add unnecessary complexity/confusion to the whole mix?
When wrapped to other languages I don't assume many (if any) languages would care about user providing precisely size buffers, and expecting the last char to be used. In those languages people are more likely to wrap things. Actually even in C++ I'd expect people to wrap stb_printf if they use it with fancy non-zero terminated string types, so the user-specified size is not going to ever be a hard requirement ("I need ten digits to fit in my char[10]") ? Also we'd not be breaking anything by not adding that.
I'm thinking of systems languages like Rust or whatever, where maybe people super do pay attention to the sizes of things. I'm concerned about somebody going "I'm going to print this 16-bit integer into a 5-byte buffer". You can't fix this by wrapping it, you have to specify "you must pass in a buffer one larger than you actually want".
Can we make the core function not append a NUL, and have the NUL-supporting "wrapper" (that used to do a strlen) add the NUL? And then one or the other of the wrappers can add or subtract one from the buffer length as needed. (It would be semantically clear if the lib was purely non-NUL-outputting with a no-NUL buffer size, and the NUL-supporting wrapper subtracts one from the buffer size.)
Will look into that (may do a separate PR to be able to get that one out of the way soon).
FYI results from all tests (VS2019, x64, /Ox)
// std sprintf
sprintf time 152775 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
sprintf time 152651 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
sprintf time 68 for '%d'
sprintf time 171346 for '%24d'
sprintf time 82907 for '%x'
sprintf time 137956 for '%08x'
sprintf time 76285 for '%f'
sprintf time 77539 for '%e'
sprintf time 78729 for '%g'
sprintf time 125701 for '%f l'
sprintf time 507478 for '%f b'
sprintf time 129690 for '%e l'
sprintf time 111196 for '%e b'
sprintf time 495866 for '%.320f'
sprintf time 51489 for '%a'
sprintf time 80013 for '%I64d'
sprintf time 83257 for '%I64d'
sprintf time 10314 for '%s%s%s'
sprintf time 234932 for '0123456789012345678901234567890123456789012345678901234567890123%d'
// v0
stbsp_sprintf time 2554 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2555 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf time 33 for '%d'
stbsp_sprintf time 40001 for '%24d'
stbsp_sprintf time 27725 for '%x'
stbsp_sprintf time 30809 for '%08x'
stbsp_printf time 33505 for '%f'
stbsp_printf time 34493 for '%e'
stbsp_printf time 33246 for '%g'
stbsp_printf time 6588 for '%6.2f'
stbsp_sprintf time 31861 for '%f l'
stbsp_sprintf time 25672 for '%f b'
stbsp_sprintf time 34911 for '%e l'
stbsp_sprintf time 34951 for '%e b'
stbsp_sprintf time 22265 for '%.320f'
stbsp_sprintf time 17445 for '%a'
stbsp_sprintf time 24924 for '%I64d'
stbsp_sprintf time 15995 for '%I64d'
stbsp_sprintf time 5756 for '%s%s%s'
stbsp_sprintf time 10420 for '0123456789012345678901234567890123456789012345678901234567890123%d'
// v5 (zero-terminated)
stbsp_sprintf time 2597 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2596 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf time 27 for '%d'
stbsp_sprintf time 40363 for '%24d'
stbsp_sprintf time 28935 for '%x'
stbsp_sprintf time 32287 for '%08x'
stbsp_printf time 33717 for '%f'
stbsp_printf time 34429 for '%e'
stbsp_printf time 33507 for '%g'
stbsp_printf time 6610 for '%6.2f'
stbsp_sprintf time 33054 for '%f l'
stbsp_sprintf time 29306 for '%f b'
stbsp_sprintf time 35913 for '%e l'
stbsp_sprintf time 34340 for '%e b'
stbsp_sprintf time 25474 for '%.320f'
stbsp_sprintf time 17332 for '%a'
stbsp_sprintf time 25826 for '%I64d'
stbsp_sprintf time 16131 for '%I64d'
stbsp_sprintf time 5826 for '%s%s%s'
stbsp_sprintf time 9720 for '0123456789012345678901234567890123456789012345678901234567890123%d'
// v5 (non zero-terminated, length of "format" calculated once at call-site)
stbsp_sprintf time 2496 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2491 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf time 26 for '%d'
stbsp_sprintf time 42418 for '%24d'
stbsp_sprintf time 29202 for '%x'
stbsp_sprintf time 32067 for '%08x'
stbsp_printf time 33865 for '%f'
stbsp_printf time 34555 for '%e'
stbsp_printf time 33557 for '%g'
stbsp_printf time 6659 for '%6.2f'
stbsp_sprintf time 32878 for '%f l'
stbsp_sprintf time 29036 for '%f b'
stbsp_sprintf time 35627 for '%e l'
stbsp_sprintf time 33958 for '%e b'
stbsp_sprintf time 25323 for '%.320f'
stbsp_sprintf time 17255 for '%a'
stbsp_sprintf time 26509 for '%I64d'
stbsp_sprintf time 16606 for '%I64d'
stbsp_sprintf time 5871 for '%s%s%s'
stbsp_sprintf time 10502 for '0123456789012345678901234567890123456789012345678901234567890123%d'
I put it in a spreadsheet. Should look at those 0.87/0.88 x slowdowns and see if they're repeatable, because it shouldn't be spending much time parsing the format string. ("speedup" columns are the speedup going from the column immediately on the left to the column immediately on the right.)
| sprintf | v0 speedup | stbsp v0 | v5 speedup | stbsp v5 | fmt |
|---|---|---|---|---|---|
| 152775 | 59.82 | 2554 | 0.98 | 2597 | for 'stringstringstringstringstring..'. |
| 152651 | 59.75 | 2555 | 0.98 | 2596 | for 'strin stringstring tringstring...' |
| 68 | 2.06 | 33 | 1.22 | 27 | for '%d' |
| 171346 | 4.28 | 40001 | 0.99 | 40363 | for '%24d' |
| 82907 | 2.99 | 27725 | 0.96 | 28935 | for '%x' |
| 137956 | 4.48 | 30809 | 0.95 | 32287 | for '%08x' |
| 76285 | 2.28 | 33505 | 0.99 | 33717 | for '%f' |
| 77539 | 2.25 | 34493 | 1.00 | 34429 | for '%e' |
| 78729 | 2.37 | 33246 | 0.99 | 33507 | for '%g' |
| 6588 | 1.00 | 6610 | for '%6.2f' | ||
| 125701 | 3.95 | 31861 | 0.96 | 33054 | for '%f l' |
| 507478 | 19.77 | 25672 | 0.88 | 29306 | for '%f b' |
| 129690 | 3.71 | 34911 | 0.97 | 35913 | for '%e l' |
| 111196 | 3.18 | 34951 | 1.02 | 34340 | for '%e b' |
| 495866 | 22.27 | 22265 | 0.87 | 25474 | for '%.320f' |
| 51489 | 2.95 | 17445 | 1.01 | 17332 | for '%a' |
| 80013 | 3.21 | 24924 | 0.97 | 25826 | for '%I64d' |
| 83257 | 5.21 | 15995 | 0.99 | 16131 | for '%I64d' |
| 10314 | 1.79 | 5756 | 0.99 | 5826 | for '%s%s%s' |
| 234932 | 22.55 | 10420 | 1.07 | 9720 | for '012345678901234567890123456...%d' |
Jeff suggests that the 0.87 "speedup" for '%.320f' can't possibly have anything to do with the time spent scanning the format string, so it must be due to some unrelated change like alignment or optimizations.
(Sorry been in holidays and put that aside, will get back to it but I’d rather be on same computer to follow up on that)
Ok, clearly "rrsprintfspeed.c" is a speed test (rrsprintf is the RAD-internal name for stb_sprintf)
Any chance that this file could be made public?