stb icon indicating copy to clipboard operation
stb copied to clipboard

stb_sprintf: support for "format" being a non-zero terminated string view?

Open ocornut opened this issue 4 years ago • 29 comments

Note: this is NOT a complete patch, more like opening a discussion.

To improve support for languages like Rust (and a growing tendency for C++ users to support string view/spans) we've been working toward supporting non-zero terminated strings in Dear ImGui. I found it has many extra advantages since many high-level languages and string types already have a length field calculated, and for literals the strlen() call can be done at compile-time.

We have a working branch but one niggle is that the sprintf functions we rely on (stdlib one or stb_sprintf) need a zero terminator for the format string parameter, which forces us to copy and zero-terminate them all. (I'm specifically taking about the format string itself needing to be zero-terminated, I am aware we can use for %.*s for other string arguments). Some of our functions tend to lead users to pass long decorated format strings (e.g. ImGui::Text("Hello world %d %s", ...) being a prime candidate for people sending large amount of text in a format string).

So I investigated the possibility of completely switching to embedding stb_sprintf and patching it to support this. Note that I'm not already using stb_printf, but I know some Dear ImGui are using it.

It's not a difficult change to implement BUT as I think stb_sprintf was designed for performances I mostly worry that the change of logic may have meaningful side effects.

If I was wondering if you or Jeff had unpublished perf tests you could run?

What this patch does:

  • At the top of the vsprintfcb code it adds const char* f_end = f + strlen(fmt); and then for the entire function makes small changes to use that pointer as an end marker and not rely on zero-termination, aka in theory it should never read from f_end[0].
  • This is a WIP patch solely designed to measure the potential side-effects (perf wise) of changing the main code.
  • Hastily coded and not thoroughly tested, but pass test_sprintf (which btw has a failing test on 32-bit arch, commented that one out) but there aren't many things tested.

Performance consideration:

  • Adding various (f < f_end) tests in the main function may add up overhead but I don't imagine so much. Might be beneficial to tweak that (e.g. use a decrementing counter?). people like jeff/ryg/sean may have better instinct what to do as I don't really know much about cpu/compilers.
  • The strlen() is going to touch all of the memory early on, may not be so much of an issue as cache would be primed for the main work? Pathological case if format string is very large (e.g. larger than cache)?

If there a decision that perfs were ok, and that we could move onward: The remaining work would be, ihmo:

  • Add a new function e.g. _vsprintfcbn() with const char* fmt, const char* fmt_end parameter (or const char* fmt, size_t fmt_len, either way would work).
  • All other functions instead of calling _vsprintfcb() would call _vsprintfcbn() with a strlen().
  • Can we come up with some tests to validate that we are not reading past the limit? (allocate near page boundaries, with end-marker just at the frontier?)
  • Make sure I didn't break anything.

(tagging @simonvanbernem who seems interested too)

ocornut avatar Jul 27 '21 15:07 ocornut

I ran some indicative general Dear ImGui perf tests.

  • There aren't designed as sprintf tests, more like testing average use of dear imgui. "perf_demo_all" is opening all of dear imgui demo/examples windows, which would lead to thousands of calls to sprintf and also a million other things happening. The other tests are spamming thousands of widgets (stuff like "perf_stress_button" is not expected to call sprintf at all, so it's a neutral test, the "coloredit" and "slider" ones would call sprintf)
  • Our perf test framework may not be super precise not great. I manually removed a few outliers from the samples.

green = Using msvc2019 sprintf blue = using stb_sprintf orange = using stb_printf with my patch

That's merely there to give a rough confirmation that: using stb_sprintf in dear imgui makes a small but measurable improvement, my patch seems to slow things down a little (but maybe some RAD magic can improve that).

x64 release image

x64 "debug" (everything is slower, printf impact gets drowned) image

ocornut avatar Jul 27 '21 15:07 ocornut

There are unpublished tests, but I don't know if they're performance tests or just correctness. I'll ask Jeff.

nothings avatar Jul 27 '21 16:07 nothings

Ok, clearly "rrsprintfspeed.c" is a speed test (rrsprintf is the RAD-internal name for stb_sprintf). I'll see if I can get it building this weekend, although I also can just send it to you if you want to try it sooner. It does stuff like exhautisvely tests all non-NAN floats, don't know how long it takes to run.

nothings avatar Jul 27 '21 16:07 nothings

Thanks for looking! If at a glance it looks like something you think I'll be able to build/run I don't mind trying. (omar AT dearimgui dot org)

ocornut avatar Jul 27 '21 16:07 ocornut

sent

nothings avatar Jul 27 '21 16:07 nothings

I ran the tests sent by Sean (they more or less run through all possible values for many types).

Here's one of the run: (stbsp_sprintf = vanilla, stbsp_sprintf2 my changes) Due to how I ran tests, sprintf was tested twice, left that in the samples.

(you may skim this part)

sprintf time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 152775
sprintf time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 152395
stbsp_sprintf  time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 2575
stbsp_sprintf2 time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 6508

sprintf time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 152651
sprintf time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 152328
stbsp_sprintf time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 2558
stbsp_sprintf2 time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 6391

sprintf time for '%d': 68
sprintf time for '%d': 69
stbsp_sprintf time for '%d': 26
stbsp_sprintf2 time for '%d': 27

sprintf time for '%24d': 171346
sprintf time for '%24d': 167891
stbsp_sprintf time for '%24d': 41285
stbsp_sprintf2 time for '%24d': 39217

sprintf time for '%x': 82907
sprintf time for '%x': 82572
stbsp_sprintf time for '%x': 27779
stbsp_sprintf2 time for '%x': 28397

sprintf time for '%08x': 137956
sprintf time for '%08x': 133784
stbsp_sprintf time for '%08x': 30877
stbsp_sprintf2 time for '%08x': 29870

sprintf time for '%f': 76285
sprintf time for '%f': 76139
stbsp_printf time for '%f': 33368
stbsp_printf2 time for '%f': 33988

sprintf time for '%e': 77539
sprintf time for '%e': 78304
stbsp_printf time for '%e': 34007
stbsp_printf2 time for '%e': 34195

sprintf time for '%g': 78729
sprintf time for '%g': 80720
stbsp_printf time for '%g': 33007
stbsp_printf2 time for '%g': 33386

sprintf time for '%f l': 125701
sprintf time for '%f l': 123365
stbsp_sprintf time for '%f l': 31882
stbsp_sprintf2 time for '%f l': 33326

sprintf time for '%f b': 507478
sprintf time for '%f b': 483318
stbsp_sprintf time for '%f b': 25873
stbsp_sprintf2 time for '%f b': 27983

sprintf time for '%e l': 129690
sprintf time for '%e l': 128700
stbsp_sprintf time for '%e l': 34656
stbsp_sprintf2 time for '%e l': 34839

sprintf time for '%e b': 111196
sprintf time for '%e b': 111247
stbsp_sprintf time for '%e b': 34471
stbsp_sprintf2 time for '%e b': 33943

sprintf time for '%.320f': 495866
sprintf time for '%.320f': 468650
stbsp_sprintf time for '%.320f': 22945
stbsp_sprintf2 time for '%.320f': 22968

sprintf time for '%a': 51489
sprintf time for '%a': 54392
stbsp_sprintf time for '%a': 17358
stbsp_sprintf2 time for '%a': 17036

sprintf time for '%I64d': 80013
sprintf time for '%I64d': 79855
stbsp_sprintf time for '%I64d': 25714
stbsp_sprintf2 time for '%I64d': 28589

sprintf time for '%I64d': 83257
sprintf time for '%I64d': 83446
stbsp_sprintf time for '%I64d': 16415
stbsp_sprintf2 time for '%I64d': 17243

sprintf time for '%s%s%s': 10314
sprintf time for '%s%s%s': 9861
stbsp_sprintf time for '%s%s%s': 5652
stbsp_sprintf2 time for '%s%s%s': 4904

sprintf time for '0123456789012345678901234567890123456789012345678901234567890123%d': 234932
sprintf time for '0123456789012345678901234567890123456789012345678901234567890123%d': 235027
stbsp_sprintf time for '0123456789012345678901234567890123456789012345678901234567890123%d': 10493
stbsp_sprintf2 time for '0123456789012345678901234567890123456789012345678901234567890123%d': 14550

I did other runs to narrow down the noise. The meaningful differences are on those three tests:

stbsp_sprintf time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 2540
stbsp_sprintf time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 2547
stbsp_sprintf time for '0123456789012345678901234567890123456789012345678901234567890123%d': 10275

stbsp_sprintf2 time for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring': 6681
stbsp_sprintf2 time for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring': 6596
stbsp_sprintf2 time for '0123456789012345678901234567890123456789012345678901234567890123%d': 14625

Difference basically amount to the code that copies the unformatted chunks of text until finding a %.

Technically speaking the tests don't spend too much time measuring parsing of many options/flags. I believe my changes would make that part also slower but in the grand scheme of this that time should be not important compared to A) the dumb-copy throughput, and B) general formatted output (for which the code is unchanged)

So basically it comes down to optimize that one loop:

Original

// fast copy everything up to the next % (or end of string)
for (;;) {
 while (((stbsp__uintptr)f) & 3) {
 schk1:
    if (f[0] == '%')
       goto scandd;
 schk2:
    if (f[0] == 0)
       goto endfmt;
    stbsp__chk_cb_buf(1);
    *bf++ = f[0];
    ++f;
 }
 for (;;) {
    // Check if the next 4 bytes contain %(0x25) or end of string.
    // Using the 'hasless' trick:
    // https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
    stbsp__uint32 v, c;
    v = *(stbsp__uint32 *)f;
    c = (~v) & 0x80808080;
    if (((v ^ 0x25252525) - 0x01010101) & c)
       goto schk1;
    if ((v - 0x01010101) & c)
       goto schk2;
    if (callback)
       if ((STB_SPRINTF_MIN - (int)(bf - buf)) < 4)
          goto schk1;
    #ifdef STB_SPRINTF_NOUNALIGNED
        if(((stbsp__uintptr)bf) & 3) {
            bf[0] = f[0];
            bf[1] = f[1];
            bf[2] = f[2];
            bf[3] = f[3];
        } else
    #endif
    {
        *(stbsp__uint32 *)bf = v;
    }
    bf += 4;
    f += 4;
 }
}
scandd:

My initial version (slower)

// fast copy everything up to the next % (or end of string)
for (;;) {
 while (((stbsp__uintptr)f) & 3) {
 schk1:
    if (f == f_end)
       goto endfmt;
 schk2:
    if (f[0] == '%')
       goto scandd;
    stbsp__chk_cb_buf(1);
    *bf++ = f[0];
    ++f;
 }
 for (;;) {
    // Check if the next 4 bytes contain %(0x25) or end of string.
    // Using the 'hasless' trick:
    // https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
    stbsp__uint32 v, c;
    v = *(stbsp__uint32 *)f;
    c = (~v) & 0x80808080;
    if (((v ^ 0x25252525) - 0x01010101) & c)
       goto schk2;
    if (f_end - f < 4)
        goto schk1;
    if (callback)
       if ((STB_SPRINTF_MIN - (int)(bf - buf)) < 4)
          goto schk1;
    #ifdef STB_SPRINTF_NOUNALIGNED
        if(((stbsp__uintptr)bf) & 3) {
            bf[0] = f[0];
            bf[1] = f[1];
            bf[2] = f[2];
            bf[3] = f[3];
        } else
    #endif
    {
        *(stbsp__uint32 *)bf = v;
    }
    bf += 4;
    f += 4;
 }
}
scandd:

I'll see if I can come up with an optimization...

ocornut avatar Jul 28 '21 13:07 ocornut

The way you swapped the goto-Labels and the two checks in the first loop, we won't check for end-of-string anymore, if the "if (((v ^ 0x25252525) - 0x01010101) & c)" branch is taken. Is this intended?

simonvanbernem avatar Jul 28 '21 13:07 simonvanbernem

There are two cases of performance change to look at:

  1. How much slower is the non-NUL-terminated new version vs. the old version of the library.
  2. How much slower is the NUL-terminated new version vs. the old version of the library.

I imagine for your personal use, you kind of care about case 1, but not THAT much since it's more a question of performance vs. your alternatives; NUL-terminated isn't an option for the new version. And you can always embed it into ImGui regardless so you don't need to worry about regressing stb_sprintf.h performance.

For the possibility of integrating this into stb_sprintf itself, case 2 matters, because we don't want to cause a significant performance loss for people using this library from C/C++, which is the primary use case.

I'm not 100% sure which case you're testing here, especially as your initial description seemed addressed to the possibility of embedding it rather than making a shippable PR? But presumably the test harness would be easier to adapt using the NUL interface.

So anyway, if you are testing case 2, another possibility is you're at least partly looking at the overhead of scanning the string twice; once for strlen() and once for non-NUL-sprintf(). I'm not 100% sure how realistic any of these tests are since the string is going to be fully in cache after the first iteration, but for the hot-case I would guess it's plausible that iterating the string twice could be a significant cost. If that IS the case (I guess I wouldn't expect it to triple the time though), one possible workaround would be to have the outer layer of stb_sprintf, which contains the strlen, to scan & copy up to the first % before doing the strlen and then hand off to sprintf. (Actually, should handle %% as well.) This wouldn't cover cases where there's a lot of unformatted text AFTER the first formatted item, but I don't see a good way to address that case and this would be pretty easy to do.

nothings avatar Jul 28 '21 13:07 nothings

BTW, it looks like the tests don't have any cases like "%6.2f", which is the sort of thing I tend to use in profilers so might be an important case for you. Probably not significantly affected by the NUL stuff though so not worth worrying about here, but maybe worth testing against regular sprintf.

nothings avatar Jul 28 '21 13:07 nothings

The way you swapped the goto-Labels and the two checks in the first loop, we won't check for end-of-string anymore, if the "if (((v ^ 0x25252525) - 0x01010101) & c)" branch is taken. Is this intended?

Right, if within the same 4 bytes there are both a % and end of string, the first iteration won't check for end-of-string, so I guess the goto schk2 can be changed to goto schk1, my bad.

I'm not 100% sure which case you're testing here, especially as your initial description seemed addressed to the possibility of embedding it rather than making a shippable PR? But presumably the test harness would be easier to adapt using the NUL interface.

In the current test I'm adding a strlen() so I'm essentially testing case 2, as the assumption is that we'd only have one version of the inner-loop which would function with an end pointer. Of course that's only a viable assumption is we can make keep it fast enough.

(Case 1 will always be a little faster than case 2 for the stb_printf call tree since there's no strlen)

Thanks for the suggestion, that's a good idea. First I'm going to see if I can make the existing loop as fast as previous version.

BTW, it looks like the tests don't have any cases like "%6.2f", which is the sort of thing I tend to use in profilers so might be an important case for you

Right. Adding to my test cases. As an indication:

sprintf       time for '%6.2f': 54103
stbsp_printf  time for '%6.2f': 16920
stbsp_printf2 time for '%6.2f': 17275

EDIT: Using stbsp__strlen() (simplified version of stbsp__strlen_limited() instead of strlen yield measurable wins. original zero-terminated

stbsp_sprintf time 2540 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2547 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf time 10275 for '0123456789012345678901234567890123456789012345678901234567890123%d'

vs new zero-terminated using strlen + non-zero terminated code

stbsp_sprintf2 time 3956 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf2 time 3968 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf2 time 11679 for '0123456789012345678901234567890123456789012345678901234567890123%d' 

ocornut avatar Jul 28 '21 15:07 ocornut

EDIT: Using stbsp__strlen() (simplified version of stbsp__strlen_limited() instead of strlen yield measurable wins. original zero-terminated

Kind of depressing since it means strlen() on whichever platform you're using isn't already doing four-at-a-time (or even SIMD).

nothings avatar Jul 28 '21 16:07 nothings

Kind of depressing since it means strlen() on whichever platform you're using isn't already doing four-at-a-time (or even SIMD).

VS2015 and VS2019 "Release" build are inlining a dumb version when intrinsic are enabled....

int fmt_len = strlen(fmt);// stbsp__strlen(fmt);
00007FF75C8445EA 49 C7 C2 FF FF FF FF mov         r10,0FFFFFFFFFFFFFFFFh  
00007FF75C8445F1 49 FF C2             inc         r10  
00007FF75C8445F4 43 80 3C 10 00       cmp         byte ptr [r8+r10],0  
00007FF75C8445F9 75 F6                jne         stbsp_vsnprintf+21h (07FF75C8445F1h)  

Confirmed.... https://developercommunity.visualstudio.com/t/it-seems-that-visual-studios-cpp-compiler-creates/737594 https://developercommunity.visualstudio.com/t/msconnect-1218177-inlined-intrinsic-strlen-is-not/204272

EDIT: Amusingly it means that in "Debug" build even with the additional jumps hoops we end up in a faster version within ucrtbased.dll.... (it does it 8 at a times on my config)

ocornut avatar Jul 28 '21 16:07 ocornut

I duplicated the copying loop (~42 lines) to provide both a zero-terminated and a pointer-terminated version.

fmt_end gets set to ~0 when using a zero-terminated function. The flag parsing code always test for f < fmt_end.

Technically speaking we could provide two versions of the whole function but this seems like a good compromise.

// v0 (vanilla stb_printf, zero-terminated input)
stbsp_sprintf  time 2553  for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf  time 2550  for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_printf   time 6596  for '%6.2f'
stbsp_sprintf  time 10379 for '0123456789012345678901234567890123456789012345678901234567890123%d'

// v5 (dual path, zero-terminated input)
stbsp_sprintf  time 2610 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf  time 2604 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_printf   time 6606 for '%6.2f'
stbsp_sprintf  time 9681 for '0123456789012345678901234567890123456789012345678901234567890123%d'

// v5 (dual path, not zero-terminated input, fmt_end provided by caller)
stbsp_sprintf  time 2493 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf  time 2493 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_printf   time 6634 for '%6.2f'
stbsp_sprintf  time 10558 for '0123456789012345678901234567890123456789012345678901234567890123%d'

For reference, those two earlier versions are giving an indication of the cost of the extra stbsp_strlen() done everytime:

// v3 (zero-terminated input + code supporting non-zero-terminated + function always calling our strlen)
stbsp_sprintf time 3970 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 3970 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_printf  time 6658 for '%6.2f'
stbsp_sprintf time 11809 for '0123456789012345678901234567890123456789012345678901234567890123%d'

// v4 (non-zero terminated input: with fmt_end provided by caller)
stbsp_sprintf time 2470 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2467 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_printf  time 6642 for '%6.2f'
stbsp_sprintf time 9461 for '0123456789012345678901234567890123456789012345678901234567890123%d'

Difference between v3 and v4 is only that v4 is providing fmt_end at the call site (doing strlen once).

Will run entire test suite again but rough idea is that the typical zero-terminated path may generally be 3-5% slower with that change, while providing non-zero terminated support. Considering that stb_printf claims x35 speed compared to VS on long strings, and x5 to x12 on various data types, the +5% loss may be ok? (other solution is to compile-time define all of it)

ocornut avatar Jul 28 '21 17:07 ocornut

Looking pretty close. I notice v5 non-nul terminated is now slower than the nul-teriminated for the unformatted strings case, is this still using your "initial version" of the copying loop for the fmt_end case, or have you made further optimizations to it?

nothings avatar Jul 28 '21 17:07 nothings

the typical zero-terminated path may generally be 3-5% slower with that change, while providing non-zero terminated support. Considering that stb_printf claims x35 speed compared to VS on long strings, and x5 to x12 on various data types, the +5% loss may be ok?

I think I'd be generally ok with the perf lost for the feature gain, but some thoughts:

  • my initial reaction was that this is a pretty frivolous feature and not worth a perf loss ever (given you could just embed your version yourself), but on reflection it's possibly interesting to have access to C-style printf from other languages AT ALL. In the early days of stb_image, one of the first hits for "stb_image" on google was a Haskell wrapper.
  • my one caveat to the perf loss being acceptable is it only makes sense if there's not a lot of new code (i'm ok with changes). so two versions of a 42-line copy loop is a bit awkward, as is having our own strlen implementation. OTOH if we can get the slowdown closer to 0% I'd be willing to accept a bigger increase in size.
  • but maybe if I give it some time to reflect on it, I'll be ok with it as is anyway

nothings avatar Jul 28 '21 17:07 nothings

VS2015 and VS2019 "Release" build are inlining a dumb version when intrinsic are enabled....

This got me curious so I did a little digging. Short version: Turns out Microsoft only optimized strnlen for some reason, might be worth trying it in the context of this PR.


For some reason Microsoft is still using frumpy old hand-written assembly for their strlen implementation. (You can read the x64 version in C:\Program Files (x86)\Windows Kits\10\Source\10.0.19041.0\ucrt\string\amd64\strlen.asm)

Interestingly, strnlen is written in C++ and has alternate paths for AVX2 and SSE2. (C:\Program Files (x86)\Windows Kits\10\Source\10.0.19041.0\ucrt\string\strnlen.cpp)

I did a quick and dirty microbenchmark with strnlen imitating strlen (IE: strnlen(str, ((size_t)-1) - ((size_t)str));), and indeed strnlen is around 10 times faster in x64 release builds:

Testing with 55316 character string with 1000000 iterations...
 strlen: 55316000000, 55316, time: 149116311 ticks total, 14911.631100 ns per iteration
strnlen: 55316000000, 55316, time: 14182812 ticks total, 1418.281200 ns per iteration
strnlen is 10.513875 faster than strlen
(Funnily enough, `strlen` still wins in debug builds.)
Testing with 55316 character string with 1000000 iterations...
 strlen: 55316000000, 55316, time: 37479447 ticks total, 3747.944700 ns per iteration
strnlen: 55316000000, 55316, time: 142507789 ticks total, 14250.778900 ns per iteration
strnlen is 0.262999 faster than strlen

PathogenDavid avatar Jul 28 '21 18:07 PathogenDavid

as is having our own strlen implementation.

That part got actually removed in v5, since the zero-terminating case doesn't need it anymore. (fyi stb_printf already had stbsp__strlen_limited(), the one I added in v2 was a trimmed down version)

Quick stats

Current version goes from 1477 to 1544 lines (+67 lines) and introduce two new symbols vsprintfcbnz and sprintfnz. The later sprintfnz I only added to make it easier for benchmark (adds +11 lines because of the va_list cruft). We could either decide to only provide the new parameter in the "most-complete" function vsprintfcbnz() or add **nz versions to all of them.

Your question on the loop:

The non-zerminated loop has changed from initial version

            // Check if the next 4 bytes contain %(0x25) or end of string.
            // Using the 'hasless' trick:
            // https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
            stbsp__uint32 v, c;
            v = *(stbsp__uint32 *)f;
            c = (~v) & 0x80808080;
            if (((v ^ 0x25252525) - 0x01010101) & c)
               goto schk2;
            if (f_end - f < 4)
               goto schk1;
            if (callback)
               if ((STB_SPRINTF_MIN - (int)(bf - buf)) < 4)
                  goto schk2;

to

            // Check if the next 4 bytes contain %(0x25) or end of string.
            // Using the 'hasless' trick:
            // https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
            if (f_end - f < 4)
               goto schk1;
            stbsp__uint32 v, c;
            v = *(stbsp__uint32 *)f;
            c = (~v) & 0x80808080;
            if (((v ^ 0x25252525) - 0x01010101) & c)
               goto schk1;
            if (callback)
               if ((STB_SPRINTF_MIN - (int)(bf - buf)) < 4)
                  goto schk1; 

As I could reliable measure a difference.

I can't well explain the result there (notice the last one has a %d at the end), but the measurements comes back generally the same every time I run them.

// v0 (vanilla)
stbsp_sprintf time 2556 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2551 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'

// v5 (dual path, zero-terminated input)
stbsp_sprintf  time 2610 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf  time 2604 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf  time 9681 for '0123456789012345678901234567890123456789012345678901234567890123%d'

// v5 (dual path, not zero-terminated input, fmt_end provided by caller)
stbsp_sprintf  time 2493 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf  time 2493 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf  time 10558 for '0123456789012345678901234567890123456789012345678901234567890123%d'

Adding a %d in two first long strings:

// v0 (vanilla)
stbsp_sprintf time 2974 for 'stringstringstringstringstring%dstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2970 for 'strin stringstring tringstring tringstring%d tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'

// v5 (dual path, zero-terminated input)
stbsp_sprintf time 2999 for 'stringstringstringstringstring%dstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2971 for 'strin stringstring tringstring tringstring%d tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'

// v5 (dual path, not zero-terminated input, fmt_end provided by caller)
stbsp_sprintf time 2992 for 'stringstringstringstringstring%dstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2962 for 'strin stringstring tringstring tringstring%d tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'

ocornut avatar Jul 28 '21 18:07 ocornut

Current version goes from 1477 to 1544 lines (+67 lines) and introduce two new symbols vsprintfcbnz and sprintfnz.

Sounds reasonable. (I wish we could get by with just the strlen() in front and all the code shared, but clearly that isn't happening.)

The later sprintfnz I only added to make it easier for benchmark (adds +11 lines because of the va_list cruft). We could either decide to only provide the new parameter in the "most-complete" function vsprintfcbnz() or add **nz versions to all of them.

I'm ok with having nz versions for all three functions. But there are five I think, so that brings us to another issue. I'd recommend only implementing versions of nz functions that take an explicit output length, but moreover, they probably should write nz output. So we probably need a way to return that output length?

nothings avatar Jul 28 '21 18:07 nothings

I'd recommend only implementing versions of nz functions that take an explicit output length, but moreover, they probably should write nz output. So we probably need a way to return that output length?

Don't we already always return output length? (not sure I understand all the callback stuff and specs when not fitting). EDIT

  char buf[256];
  int n;
  n = stbsp_snprintf(buf, 10, "hellohellohello");
  printf("%d\n", n);
  printf("'%s'\n", buf);
15
'hellohell'

Right, so I think it'd be great to add for completeness. But now I am getting paranoid about adding parameters (read below).


And this is what a compile-time option version would roughly look like https://github.com/nothings/stb/pull/1181/commits/60a999d14a80d34aa6d14c4bd6a5843d4afc5473


( Except currently I am running into a hard to understand thing were the compiler starts optimizing differently based on something I don't understand. Spent the last hour combing through it, but basically if I revert to vanilla version and add int fmt_len parameter (UNUSED) + one replacement wrapping function:

STBSP__PUBLICDEC int STB_SPRINTF_DECORATE(vsprintfcbnz)(STBSP_SPRINTFCB* callback, void* user, char* buf, char const* fmt, int /*fmt_len*/, va_list va); + vsprintfcb() now calling vsprintfcbnz()

Then my numbers for some tests are instantly slower by 20%...

If I move fmt_len to first parameter I am back at the original speed. I assume there's some code reuse or tail-call thing happening with matching stack layout, but I'm surprised by the large difference and I don't really understand it. [Note that I am currently technically compiling in C++]

And so it happens on vanilla codebase but also adding that #ifdef stuff over v5 made it happen and I'm super puzzled.

EDIT: I was wondering why callback, user parameters were the first and not last ones in vsprintfcb() and I wonder if it is due to the same thing. )

ocornut avatar Jul 28 '21 20:07 ocornut

I'd recommend only implementing versions of nz functions that take an explicit output length, but moreover, they probably should write nz output. So we probably need a way to return that output length?

Don't we already always return output length? (not sure I understand all the callback stuff and specs when not fitting).

Oh yes, of course we do always return output length.

However, it probably includes the length of the terminating NUL. We should presumably engineer the nz API so it (a) returns the number of non-NUL characters, (b) doesn't write the NUL, and (c) if you give it a buffer of length 10, it will write 10 non-NUL characters to it. (We can presumably do that by lying about the buffer length to ourselves and then suppressing the terminating NUL.)

Right, so I think it'd be great to add for completeness. But now I am getting paranoid about adding parameters (read below).

I think we changed the behavior of this, I think it used to return the length written, but for compatibility with other sprintfs it was changed to always return the full required length? Hmm, no, that seems the opposite of what other sprintfs do. Pretty sure it changed at some point, or at least that people have complained about this.

But the nz versions can do their own thing regardless. Although they could return the full required length and let the caller do a min() on the passed-in length, we can also just change the semantics for nz and have it only give the full length on passing in NUL.

I think the argument for the current behavior is the scenario where you have a small buffer on the stack, and you'll malloc a large buffer if it's not enough. You want to call sprintf once with the stack, and if that's enough you're done. If that's not enough, you don't want to make ANOTHER sprintf call to find out the required length, malloc, and then make a THIRD sprintf call if you can do it all with two calls. (This sounds like a Jeff argument to me.)

nothings avatar Jul 28 '21 20:07 nothings

Current behavior seems fine, and I recall the thread arguing for matching standard (I probably voted for that).

IMHO the only value/purpose of adding nz output support would to save 1 character and allow users to retrieve a 10-long string when providing a 10-long buffer.

It's possible but seems like it would add unnecessary complexity/confusion to the whole mix?

When wrapped to other languages I don't assume many (if any) languages would care about user providing precisely size buffers, and expecting the last char to be used. In those languages people are more likely to wrap things. Actually even in C++ I'd expect people to wrap stb_printf if they use it with fancy non-zero terminated string types, so the user-specified size is not going to ever be a hard requirement ("I need ten digits to fit in my char[10]") ? Also we'd not be breaking anything by not adding that.

ocornut avatar Jul 28 '21 20:07 ocornut

I'm thinking of systems languages like Rust or whatever, where maybe people super do pay attention to the sizes of things. I'm concerned about somebody going "I'm going to print this 16-bit integer into a 5-byte buffer". You can't fix this by wrapping it, you have to specify "you must pass in a buffer one larger than you actually want".

nothings avatar Jul 28 '21 20:07 nothings

Can we make the core function not append a NUL, and have the NUL-supporting "wrapper" (that used to do a strlen) add the NUL? And then one or the other of the wrappers can add or subtract one from the buffer length as needed. (It would be semantically clear if the lib was purely non-NUL-outputting with a no-NUL buffer size, and the NUL-supporting wrapper subtracts one from the buffer size.)

nothings avatar Jul 28 '21 20:07 nothings

Will look into that (may do a separate PR to be able to get that one out of the way soon).

ocornut avatar Jul 28 '21 20:07 ocornut

FYI results from all tests (VS2019, x64, /Ox)

// std sprintf
sprintf       time 152775 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
sprintf       time 152651 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
sprintf       time 68     for '%d'
sprintf       time 171346 for '%24d'
sprintf       time 82907  for '%x'
sprintf       time 137956 for '%08x'
sprintf       time 76285  for '%f'
sprintf       time 77539  for '%e'
sprintf       time 78729  for '%g'
sprintf       time 125701 for '%f l'
sprintf       time 507478 for '%f b'
sprintf       time 129690 for '%e l'
sprintf       time 111196 for '%e b'
sprintf       time 495866 for '%.320f'
sprintf       time 51489  for '%a'
sprintf       time 80013  for '%I64d'
sprintf       time 83257  for '%I64d'
sprintf       time 10314  for '%s%s%s'
sprintf       time 234932 for '0123456789012345678901234567890123456789012345678901234567890123%d'

// v0
stbsp_sprintf time 2554  for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2555  for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf time 33    for '%d'
stbsp_sprintf time 40001 for '%24d'
stbsp_sprintf time 27725 for '%x'
stbsp_sprintf time 30809 for '%08x'
stbsp_printf  time 33505 for '%f'
stbsp_printf  time 34493 for '%e'
stbsp_printf  time 33246 for '%g'
stbsp_printf  time 6588  for '%6.2f'
stbsp_sprintf time 31861 for '%f l'
stbsp_sprintf time 25672 for '%f b'
stbsp_sprintf time 34911 for '%e l'
stbsp_sprintf time 34951 for '%e b'
stbsp_sprintf time 22265 for '%.320f'
stbsp_sprintf time 17445 for '%a'
stbsp_sprintf time 24924 for '%I64d'
stbsp_sprintf time 15995 for '%I64d'
stbsp_sprintf time 5756  for '%s%s%s'
stbsp_sprintf time 10420 for '0123456789012345678901234567890123456789012345678901234567890123%d'

// v5 (zero-terminated)
stbsp_sprintf time 2597 for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2596 for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf time 27    for '%d'
stbsp_sprintf time 40363 for '%24d'
stbsp_sprintf time 28935 for '%x'
stbsp_sprintf time 32287 for '%08x'
stbsp_printf  time 33717 for '%f'
stbsp_printf  time 34429 for '%e'
stbsp_printf  time 33507 for '%g'
stbsp_printf  time 6610  for '%6.2f'
stbsp_sprintf time 33054 for '%f l'
stbsp_sprintf time 29306 for '%f b'
stbsp_sprintf time 35913 for '%e l'
stbsp_sprintf time 34340 for '%e b'
stbsp_sprintf time 25474 for '%.320f'
stbsp_sprintf time 17332 for '%a'
stbsp_sprintf time 25826 for '%I64d'
stbsp_sprintf time 16131 for '%I64d'
stbsp_sprintf time 5826  for '%s%s%s'
stbsp_sprintf time 9720  for '0123456789012345678901234567890123456789012345678901234567890123%d'

// v5 (non zero-terminated, length of "format" calculated once at call-site)
stbsp_sprintf time 2496  for 'stringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstringstring'
stbsp_sprintf time 2491  for 'strin stringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring tringstring'
stbsp_sprintf time 26    for '%d'
stbsp_sprintf time 42418 for '%24d'
stbsp_sprintf time 29202 for '%x'
stbsp_sprintf time 32067 for '%08x'
stbsp_printf  time 33865 for '%f'
stbsp_printf  time 34555 for '%e'
stbsp_printf  time 33557 for '%g'
stbsp_printf  time 6659  for '%6.2f'
stbsp_sprintf time 32878 for '%f l'
stbsp_sprintf time 29036 for '%f b'
stbsp_sprintf time 35627 for '%e l'
stbsp_sprintf time 33958 for '%e b'
stbsp_sprintf time 25323 for '%.320f'
stbsp_sprintf time 17255 for '%a'
stbsp_sprintf time 26509 for '%I64d'
stbsp_sprintf time 16606 for '%I64d'
stbsp_sprintf time 5871  for '%s%s%s'
stbsp_sprintf time 10502 for '0123456789012345678901234567890123456789012345678901234567890123%d'

ocornut avatar Jul 28 '21 21:07 ocornut

I put it in a spreadsheet. Should look at those 0.87/0.88 x slowdowns and see if they're repeatable, because it shouldn't be spending much time parsing the format string. ("speedup" columns are the speedup going from the column immediately on the left to the column immediately on the right.)

sprintf v0 speedup stbsp v0 v5 speedup stbsp v5 fmt
152775 59.82 2554 0.98 2597 for 'stringstringstringstringstring..'.
152651 59.75 2555 0.98 2596 for 'strin stringstring tringstring...'
68 2.06 33 1.22 27 for '%d'
171346 4.28 40001 0.99 40363 for '%24d'
82907 2.99 27725 0.96 28935 for '%x'
137956 4.48 30809 0.95 32287 for '%08x'
76285 2.28 33505 0.99 33717 for '%f'
77539 2.25 34493 1.00 34429 for '%e'
78729 2.37 33246 0.99 33507 for '%g'
6588 1.00 6610 for '%6.2f'
125701 3.95 31861 0.96 33054 for '%f l'
507478 19.77 25672 0.88 29306 for '%f b'
129690 3.71 34911 0.97 35913 for '%e l'
111196 3.18 34951 1.02 34340 for '%e b'
495866 22.27 22265 0.87 25474 for '%.320f'
51489 2.95 17445 1.01 17332 for '%a'
80013 3.21 24924 0.97 25826 for '%I64d'
83257 5.21 15995 0.99 16131 for '%I64d'
10314 1.79 5756 0.99 5826 for '%s%s%s'
234932 22.55 10420 1.07 9720 for '012345678901234567890123456...%d'

nothings avatar Jul 28 '21 22:07 nothings

Jeff suggests that the 0.87 "speedup" for '%.320f' can't possibly have anything to do with the time spent scanning the format string, so it must be due to some unrelated change like alignment or optimizations.

nothings avatar Aug 10 '21 22:08 nothings

(Sorry been in holidays and put that aside, will get back to it but I’d rather be on same computer to follow up on that)

ocornut avatar Aug 11 '21 07:08 ocornut

Ok, clearly "rrsprintfspeed.c" is a speed test (rrsprintf is the RAD-internal name for stb_sprintf)

Any chance that this file could be made public?

stefano-zanotti-88 avatar Apr 24 '25 15:04 stefano-zanotti-88