STL icon indicating copy to clipboard operation
STL copied to clipboard

<format>: Improve performance

Open StephanTLavavej opened this issue 4 years ago • 14 comments

Reported by @jovibor - here are examples where <format> is 20% to 70% slower than libfmt:

https://github.com/microsoft/STL/issues/1800#issuecomment-812815620 https://github.com/microsoft/STL/issues/1800#issuecomment-812838596

StephanTLavavej avatar Apr 03 '21 21:04 StephanTLavavej

The issue is probably caused by not preallocating any memory for the output string. I think reserving the size of the format string would be a good guess.

AdamBucior avatar Apr 05 '21 15:04 AdamBucior

The issue is probably caused by not preallocating any memory for the output string

Shouldn't the strings from the benchmark fit into the Small String Buffer and not need allocation at all?

I think reserving the size of the format string would be a good guess.

At least for simple formaters, I think this is almost guaranteed to be too short {} is only two bytes. Integral numbers only fit in that for very small values, floating point will essentially always require more and if the argument is a string or some more complicated type just two chars will also almost never be enough.

Not that I have a good counterproposal, but usually, it is better to overallocate by a few bytes instread of allocating just not enough and having to reallocate.

Any idea what fmt does? Are you even allowed to look at the fmt source code?

MikeGitb avatar Apr 05 '21 17:04 MikeGitb

The "interpolated strings" feature in C#, which is extremely similar to <format> is lowered by the compiler in terms of the runtime's implementation of String.Format(...), which in turn defers to FormatHelper, which speculatively allocates <format string length> + <number of args> * 8.

Maybe it's worth finding out whether this is an informed guess, the result of analyzing the runtime behaviour of existing codebases, or just 🎩✨🐇 ¯_(ツ)_/¯

ZaldronGG avatar Apr 05 '21 17:04 ZaldronGG

Shouldn't the strings from the benchmark fit into the Small String Buffer and not need allocation at all?

They were wstrings which have half the Small String Buffer of string.

which speculatively allocates <format string length> + <number of args> * 8.

Taking number of args into account might also be a good idea. Maybe we could even filter string(_view) arguments from args and take their lengths into account.

AdamBucior avatar Apr 05 '21 19:04 AdamBucior

This is definitely related to the formatter string size. When string is too long (avoiding SSO) the results for std::format become even worse, it's about two times slower than fmt. The difference between wstring and string formatters is also significant, with the latter being faster.

Here the full test code with format_to and swprintf included.

jovibor avatar Apr 05 '21 22:04 jovibor

I'm still a bit concerned about performance. After all the recent changes, the difference is as follows: image

The std::format_to is almost on pair with fmt::format_to. While std::format is more than 3 times slower than its fmt counterpart.

Microsoft Visual Studio 16.10.0 Preview 2.1 /O2 /utf-8 std::format version: 02ae70a

jovibor avatar May 01 '21 02:05 jovibor

Can you share your benchmark code?

sylveon avatar May 01 '21 19:05 sylveon

Can you share your benchmark code?

The link is one post above.

jovibor avatar May 01 '21 21:05 jovibor

Thanks, after looking at it through a profiler, we are spending an absurd amount of time in _Fmt_iterator_buffer<back_insert_iterator<wstring>>::_Grow. If I replace the format_to benchmark to use a back_insert_iterator, I am seeing a similar amounts of time.

sylveon avatar May 02 '21 02:05 sylveon

#1894 should resolve this

sylveon avatar May 02 '21 05:05 sylveon

performance is really terrible .

fmt 8.1.0 Visual Studio 17.1 preview 3

#define _CRT_SECURE_NO_WARNINGS
#include <cstdio>
#include <format>
#include <benchmark/benchmark.h>
#include <fmt/format.h>

void std_sprintf(benchmark::State& s) {
    for (auto _ : s) {
        char buffer[12];
        unsigned size = std::sprintf(buffer, "%d", 42);
    }
}
BENCHMARK(std_sprintf);

void std_format(benchmark::State& s) {
    for (auto _ : s) {
        auto str = std::format("{}", 42);
    }
}
BENCHMARK(std_format);

void std_format_2(benchmark::State& s) {
    for (auto _ : s) {
        auto str = std::format("{:d}", 42);
    }
}
BENCHMARK(std_format_2);

void fmt_format(benchmark::State& s) {
    for (auto _ : s)
        auto str = fmt::format("{}", 42);
}
BENCHMARK(fmt_format);

void fmt_format_2(benchmark::State& s) {
    for (auto _ : s)
        auto str = fmt::format("{:d}", 42);
}
BENCHMARK(fmt_format_2);

void std_format_to(benchmark::State& s) {
    std::string buffer;
    buffer.reserve(12);
    for (auto _ : s) {
        auto str = std::format_to(std::back_inserter(buffer), "{}", 42);
    }
}
BENCHMARK(std_format_to);

void std_format_to_2(benchmark::State& s) {
    std::string buffer;
    buffer.reserve(12);
    for (auto _ : s) {
        auto str = std::format_to(std::back_inserter(buffer), "{:d}", 42);
    }
}
BENCHMARK(std_format_to_2);

void fmt_format_to(benchmark::State& s) {
    std::string buffer;
    buffer.reserve(12);
    for (auto _ : s)
        auto str = fmt::format_to(std::back_inserter(buffer), "{}", 42);
}
BENCHMARK(fmt_format_to);

void fmt_format_to_2(benchmark::State& s) {
    std::string buffer;
    buffer.reserve(12);
    for (auto _ : s)
        auto str = fmt::format_to(std::back_inserter(buffer), "{:d}", 42);
}
BENCHMARK(fmt_format_to_2);

BENCHMARK_MAIN();
Running bench.exe
Run on (4 X 2712 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
std_sprintf           63.5 ns         56.2 ns     10000000
std_format            3475 ns         3223 ns       213333
std_format_2          7274 ns         6719 ns       100000
fmt_format            41.2 ns         39.3 ns     18666667
fmt_format_2          85.4 ns         83.7 ns      7466667
std_format_to         3398 ns         3209 ns       224000
std_format_to_2       7051 ns         6417 ns       112000
fmt_format_to         34.8 ns         33.7 ns     19478261
fmt_format_to_2       62.6 ns         60.9 ns     10000000

crackedmind avatar Jan 24 '22 00:01 crackedmind

@crackedmind Would you mind benchmarking wide-string counterparts (L"") as well?

Also, it was mentioned previously to compile with /utf-8 compiler flag, to improve performance.

jovibor avatar Jan 24 '22 01:01 jovibor

/utf-8 makes an ENORMOUS difference, we're working on spreading that effect to other encodings.

Benchmark code:
#define _CRT_SECURE_NO_WARNINGS
#include <cstdio>
#include <format>
#include <benchmark/benchmark.h>
#include <fmt/format.h>
#include <fmt/xchar.h>

void std_sprintf(benchmark::State& s) {
    for (auto _ : s) {
        char buffer[12];
        unsigned size = std::sprintf(buffer, "%d", 42);
    }
}
BENCHMARK(std_sprintf);

void std_format(benchmark::State& s) {
    for (auto _ : s) {
        auto str = std::format("{}", 42);
    }
}
BENCHMARK(std_format);

void std_format_2(benchmark::State& s) {
    for (auto _ : s) {
        auto str = std::format("{:d}", 42);
    }
}
BENCHMARK(std_format_2);

void fmt_format(benchmark::State& s) {
    for (auto _ : s)
        auto str = fmt::format("{}", 42);
}
BENCHMARK(fmt_format);

void fmt_format_2(benchmark::State& s) {
    for (auto _ : s)
        auto str = fmt::format("{:d}", 42);
}
BENCHMARK(fmt_format_2);

void std_format_to(benchmark::State& s) {
    std::string buffer;
    buffer.reserve(12);
    for (auto _ : s) {
        auto str = std::format_to(std::back_inserter(buffer), "{}", 42);
    }
}
BENCHMARK(std_format_to);

void std_format_to_2(benchmark::State& s) {
    std::string buffer;
    buffer.reserve(12);
    for (auto _ : s) {
        auto str = std::format_to(std::back_inserter(buffer), "{:d}", 42);
    }
}
BENCHMARK(std_format_to_2);

void fmt_format_to(benchmark::State& s) {
    std::string buffer;
    buffer.reserve(12);
    for (auto _ : s)
        auto str = fmt::format_to(std::back_inserter(buffer), "{}", 42);
}
BENCHMARK(fmt_format_to);

void fmt_format_to_2(benchmark::State& s) {
    std::string buffer;
    buffer.reserve(12);
    for (auto _ : s)
        auto str = fmt::format_to(std::back_inserter(buffer), "{:d}", 42);
}
BENCHMARK(fmt_format_to_2);

void std_swprintf(benchmark::State& s) {
    for (auto _ : s) {
        wchar_t buffer[12];
        unsigned size = std::swprintf(buffer, L"%d", 42);
    }
}
BENCHMARK(std_swprintf);

void std_wformat(benchmark::State& s) {
    for (auto _ : s) {
        auto str = std::format(L"{}", 42);
    }
}
BENCHMARK(std_wformat);

void std_wformat_2(benchmark::State& s) {
    for (auto _ : s) {
        auto str = std::format(L"{:d}", 42);
    }
}
BENCHMARK(std_wformat_2);

void fmt_wformat(benchmark::State& s) {
    for (auto _ : s)
        auto str = fmt::format(L"{}", 42);
}
BENCHMARK(fmt_wformat);

void fmt_wformat_2(benchmark::State& s) {
    for (auto _ : s)
        auto str = fmt::format(L"{:d}", 42);
}
BENCHMARK(fmt_wformat_2);

void std_wformat_to(benchmark::State& s) {
    std::wstring buffer;
    buffer.reserve(12);
    for (auto _ : s) {
        auto str = std::format_to(std::back_inserter(buffer), L"{}", 42);
    }
}
BENCHMARK(std_wformat_to);

void std_wformat_to_2(benchmark::State& s) {
    std::wstring buffer;
    buffer.reserve(12);
    for (auto _ : s) {
        auto str = std::format_to(std::back_inserter(buffer), L"{:d}", 42);
    }
}
BENCHMARK(std_wformat_to_2);

void fmt_wformat_to(benchmark::State& s) {
    std::wstring buffer;
    buffer.reserve(12);
    for (auto _ : s)
        auto str = fmt::format_to(std::back_inserter(buffer), L"{}", 42);
}
BENCHMARK(fmt_wformat_to);

void fmt_wformat_to_2(benchmark::State& s) {
    std::wstring buffer;
    buffer.reserve(12);
    for (auto _ : s)
        auto str = fmt::format_to(std::back_inserter(buffer), L"{:d}", 42);
}
BENCHMARK(fmt_wformat_to_2);

BENCHMARK_MAIN();

Benchmark results:

C:\Users\Casey\OneDrive\Desktop>cl /nologo /std:c++latest /EHsc /I c:\STL\stl\inc c:\Users\Casey\OneDrive\Desktop\repro.cpp /I c:\Users\Casey\Repos\vcpkg\installed\x64-windows\include /MD /O2 /Ob3 /utf-8 /link /LIBPATH:c:\Users\Casey\Repos\vcpkg\installed\x64-windows\lib benchmark.lib fmt.lib shlwapi.lib
repro.cpp

C:\Users\Casey\OneDrive\Desktop>repro
2022-01-23T17:36:52-08:00
Running repro
Run on (32 X 3493 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 16384 KiB (x4)
-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
std_sprintf            53.1 ns         53.1 ns     10000000
std_format             42.0 ns         42.4 ns     16592593
std_format_2           83.4 ns         83.7 ns      7466667
fmt_format             30.9 ns         30.7 ns     22400000
fmt_format_2           72.7 ns         73.2 ns      8960000
std_format_to          43.3 ns         43.0 ns     16000000
std_format_to_2        83.7 ns         83.7 ns      8960000
fmt_format_to          25.9 ns         25.5 ns     26352941
fmt_format_to_2        52.4 ns         53.1 ns     10000000
std_swprintf           38.9 ns         39.3 ns     18666667
std_wformat             127 ns          126 ns      5600000
std_wformat_2           150 ns          151 ns      4977778
fmt_wformat            33.2 ns         33.0 ns     20363636
fmt_wformat_2          58.5 ns         58.6 ns     11200000
std_wformat_to         65.6 ns         65.6 ns     11200000
std_wformat_to_2       83.6 ns         83.7 ns      8960000
fmt_wformat_to         37.4 ns         36.8 ns     18666667
fmt_wformat_to_2       64.2 ns         64.2 ns     11200000

Looks like format(L"") needs some other perf work as well - fmt and swprintf are crushing us.

CaseyCarter avatar Jan 24 '22 01:01 CaseyCarter