perl5 Use per-word calcs in utf8

This commit changes utf8_length to read the input a word at a time. The current method of looking per character is retained for shorter strings. The per-word method yields significant time savings for very long strings and typical inputs.

The timings vary depending on the average number of bytes per character in the input. If all our characters were 13 bytes, this commit would always be a loser, as we would be processing per 8 (or 4 on 32-bit platforms) instead of 13. But we don't care about performance for non-Unicode code points, and the maximum legal Unicode code point occupies 4 UTF-8 bytes, which means that is a wash on 32-bit platforms, but a real gain on 64-bit ones. And, except for emoji, most text in modern languages is 3 byte max, with a significant amount of single byte characters (e.g., for punctuation) even in non-Latin scripts.

For very long strings we would expect to use 1/8 the conditionals if the input is entirely ASCII; 1/4 if entirely 2-byte UTF-8, and 1/2 if entirely 4-byte. (For 32-bit systems, the savings is approximately half this.) Because of set-up and tear-down complications, these values are limits that are approached the longer the string is (which is where it matters most).

The per-word method kicks in for input strings 96 bytes and longer. This value was based on some eyeballing cache grind output, and could be tweaked, but the differences in time spent on strings this short is tiny.

This function does a half-hearted job of checking for UTF-8 validity; it doesn't do extra work, but it makes sure that the length implied by the start bytes it sees, all add up. (It doesn't check that the characters in between are all continuation bytes.) In order to preserve this checking, the new version has to stop per-word looking a word earlier than it otherwise would have.

There are complications, as it has to process per-byte to get to a word boundary before reading per-word. Here are benchmarks for a 2-byte word using the best and worst case scenarios. (All benchmarks are for a 64-bit platform)

Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches

The numbers represent relative counts per loop iteration, compared to blead at 100.0%. Higher is better: for example, using half as many instructions gives 200%, while using twice as many gives 50%.

Best case 2-byte sceanario: string length 48 characters; 2 bytes per character; 0 bytes after word boundary

   blead   patch
   ------ -------
Ir 100.00  123.09
Dr 100.00  130.18
Dw 100.00  111.44

COND 100.00 128.63 IND 100.00 100.00

Worst case 2-byte sceanario: string length 48 characters; 2 bytes per character; 7 bytes after word boundary

   blead   patch
   ------ -------
Ir 100.00  122.46
Dr 100.00  129.52
Dw 100.00  111.07

COND 100.00 127.65 IND 100.00 100.00

Very long strings run an order of magnitude fewer instructions than blead. Here are worst case scenarios (7 bytes after word boundary).

string length 10000000 characters; 1 bytes per character

   blead   patch
   ------ -------
Ir 100.00  814.53
Dr 100.00 1069.58
Dw 100.00 3296.55

COND 100.00 1575.83 IND 100.00 100.00

string length 5000000 characters; 2 bytes per character

   blead   patch
   ------ -------
Ir 100.00  408.86
Dr 100.00  536.32
Dw 100.00 1698.31

COND 100.00 788.72 IND 100.00 100.00

string length 3333333 characters; 3 bytes per character

   blead   patch
   ------ -------
Ir 100.00  273.64
Dr 100.00  358.56
Dw 100.00 1165.55

COND 100.00 526.35 IND 100.00 100.00

string length 2500000 characters; 4 bytes per character

   blead   patch
   ------ -------
Ir 100.00  206.03
Dr 100.00  269.68
Dw 100.00  899.17

COND 100.00 395.17 IND 100.00 100.00

Jul 07 '22 15:07 khwilliamson

You're under-selling the speedup and over-estimating the length threshold needed to break even.

The previous code (i.e. x += UTF8SKIP(x);) is bottlenecked by a loop-carried dependency chain, which is not the sort of thing that Valgrind (and thus Porting/bench.pl) can measure. That's two memory reads per iteration, and it needs to know the result from each read before it can compute the address for the next. So cycles-per-iteration is twice the L1 cache latency.

This PR reduces the dependency chain to a simple integer addition, and thus gets much better instructions-per-cycle.

So with 48 characters at 2bpc, even though it's only a 1.23x improvement in number of instructions executed (as you said), I measure a 2.7x actual speedup (using Benchmark::timethis() on a Sandybridge cpu). With a megabyte at 1bpc, which is 8x improvement in number of instructions executed (as you said), I measure a 20x actual speedup. I get a small but still positive speedup with a length threshold as low as 16 bytes.

Jul 14 '22 04:07 pengvado

Did you (Karl) ever benchmark the really old version of utf8_length that used x += UTF8_IS_INVARIANT(*x) ? 1 : UTF8SKIP(x);? That branchy version was in perl prior to 2012, and you reverted it in commit 4cbf4130b603602204e51a4eaae4cab886651611, and the commit message sounds like you just assumed that branchless would be faster. But my previous comment about dependency chains and the inadequacy of Porting/bench.pl applies to this too: The branchy version was actually faster than branchless on any string that has a significant fraction of 1-byte characters, because the branch predictor breaks the loop-carried dependency chain. It's not as good as this PR, but it's better than what we had for the last 10 years.

Since we can't use word-at-a-time everywhere, we could change the definition of UTF8SKIP to apply the branchy version to all the other callsites that are still scalar.

Jul 14 '22 07:07 pengvado

@pengvado did you measure in the worst case, where the starting byte is 1 past a word boundary?

It's fairly arbitrary where the cut-off is, since the delta in times is small for small strings

Jul 16 '22 16:07 khwilliamson

@pengvado I just assumed "no-branch good", "branch bad".

I don't think it is wise to change UTF8_SKIP itself, as that is a general purpose macro, that isn't necessarily in a loop.

Jul 16 '22 16:07 khwilliamson

@khwilliamson

@pengvado did you measure in the worst case, where the starting byte is 1 past a word boundary?

It's fairly arbitrary where the cut-off is, since the delta in times is small for small strings

Not sure, what the current solution is for speed comparison.

Implemented in pure C it's superior to all other solutions (except SIMD optimised), even with ignoring the alignment and using memcopy() (see https://github.com/wollmers/utf8-bench):

/*
http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
Colin Percival
*/

#define ONEMASK ((size_t)(-1) / 0xFF)

static size_t
cp_utf8_chars (const uint8_t * buf, size_t len) {

    size_t offset = 0; 
    size_t continuations = 0;

    while (offset + sizeof(size_t) < len) { 
        // try the next block of size_t bytes
        size_t v;
        memcpy(&v, buf + offset, sizeof(size_t)); 
        v = ((v & (ONEMASK * 0x80)) >> 7) & ((~v) >> 6); 
        continuations += (v * ONEMASK) >> ((sizeof(size_t) - 1) * 8); 
        offset        += sizeof(size_t); 
    }

    // Take care of any left-over bytes.
    while (offset < len) {
        uint8_t byte = buf[offset]; 

        // Is this byte NOT the first byte of a character?
        continuations += (byte >> 7) & ((~byte) >> 6);
        offset++; 
    }

    return (len - continuations);
}

This gives with -O1 (to avoid too much optimisation):

utf8-bench/src$ date; make clean; make; time ./utf8bench
So 17 Jul 2022 12:43:38 CEST
rm -rf *.o utf8bench
cc -std=c99 -pedantic -Wall -O1 -c -o utf8bench.o utf8bench.c 
cc -std=c99 -pedantic -Wall -O1 -o utf8bench utf8bench.o	
[jb_ascii_l7] bytes: 7 chars: 7 expect: 7
[...]
[jb_utf8_104] bytes: 116 chars: 104 expect: 104
[...]
[jb_ascii_l7] iters: 100 M s: 0.531264 Mops/s: 188.2 Mchr/s: 1317.6 7 
[jb_utf_104]  iters: 2 M s: 0.543372 Mops/s: 3.7 Mchr/s: 382.8 104

[cp_ascii_l7] iters: 100 M s: 0.140800 Mops/s: 710.2 Mchr/s: 4971.6 7
[cp_utf_104]  iters: 100 M s: 1.279760 Mops/s: 78.1 Mchr/s: 8126.5 104

[sc_ascii_l7] iters: 100 M s: 0.517990 Mops/s: 193.1 Mchr/s: 1351.4 7
[sc_utf_104]  iters: 100 M s: 6.291771 Mops/s: 15.9 Mchr/s: 1653.0 104

Total: 40.503896 seconds

real	0m40.753s
user	0m40.491s
sys	0m0.016s

Benchmarks are a PITA. The above may be not precise. Need to write a better one with random data. But something in the range 4-8 GB/sec, and 5-fold for 8 byte bandwidth seems reasonable.

With a thin wrapper in XS:

utf8-bench/xt$ date; time perl 50_utf8_bench.t
So 17 Jul 2022 19:04:23 CEST
ok 1

$string: Abcdefg chars: 7 bytes:: 7
                Rate jb_chars cp_chars sc_chars   length
jb_chars  20201922/s       --     -10%     -13%     -84%
cp_chars  22378146/s      11%       --      -4%     -83%
sc_chars  23198583/s      15%       4%       --     -82%
length   130296426/s     545%     482%     462%       --

$string: Abcdefgh chars: 8 bytes: 8
                Rate jb_chars cp_chars sc_chars   length
jb_chars  18034477/s       --     -12%     -19%     -87%
cp_chars  20388977/s      13%       --      -9%     -85%
sc_chars  22332754/s      24%      10%       --     -84%
length   139810133/s     675%     586%     526%       --

$string: Chſerſplzon chars: 11 bytes: 13
               Rate jb_chars sc_chars cp_chars   length
jb_chars 16287053/s       --     -22%     -28%     -78%
sc_chars 20863195/s      28%       --      -7%     -72%
cp_chars 22526952/s      38%       8%       --     -70%
length   74644392/s     358%     258%     231%       --

$string: राज्यराज्य chars: 9 bytes: 27
               Rate jb_chars sc_chars cp_chars   length
jb_chars 16834936/s       --     -17%     -18%     -81%
sc_chars 20201923/s      20%       --      -2%     -77%
cp_chars 20560314/s      22%       2%       --     -77%
length   88543854/s     426%     338%     331%       --

$string: Chſerſplzon x 8 chars: 88 bytes: 104
               Rate jb_chars sc_chars cp_chars   length
jb_chars  3780922/s       --     -75%     -78%     -95%
sc_chars 15252015/s     303%       --     -11%     -80%
cp_chars 17149607/s     354%      12%       --     -78%
length   76865311/s    1933%     404%     348%       --
1..1

real	2m45.888s
user	2m45.765s
sys	0m0.093s

That's a magnitude slower than pure C caused by the overhead, and the differences are hidden in the random deviations.

At least the threshold of 96 bytes isn't reasonable so far (for me). Worst case 7 bytes can be unaligned, thus 7 + 8 = 15 can profit. Also in single byte mode (leading and trailing bytes) the bit operations seem faster than branchy.

Jul 17 '22 19:07 wollmers

Interestingly enough, this compiles down to code that works 16-bytes at a time, or even 32 bytes at a time with a sufficient -march

Jul 18 '22 01:07 tonycoz

Interestingly enough, this compiles down to code that works 16-bytes at a time, or even 32 bytes at a time with a sufficient -march

That's true, but has little effect (+3-17% faster on long strings), even if continuations += _mm_popcnt_u64(v); is used.

With

cc -std=c99 -march=native -pedantic -Wall -O3

the simple scalar fallback

static size_t sc_strlen_utf8 (const char * buf, size_t len) {
    size_t continuations = 0;

    for (size_t i = 0; i < len; i++) {
        if(buf[i] <= -65) { continuations++; }
    }
    return (len - continuations);
}

doubles the speed, is the winner on i7-9750H and +11 to +16% (Japanese), +18% (German) and +22 to +24% (ASCII) faster than the portable solution of the PR.

Also interestingly enough, the fastest and smallest implementation using a lookup table (like the fast Unicode::UTF-8 does), reaches only 0.35 GB/s (ASCII, German) and 1 GB/s (Japanese), compared to 7.5 GB/s of the scalar solution. That's 20-fold.

Jul 20 '22 16:07 wollmers

On Thu, 14 Jul 2022 at 09:18, Loren Merritt @.***> wrote:

Did you (Karl) ever benchmark the really old version of utf8_length that used x += UTF8_IS_INVARIANT(*x) ? 1 : UTF8SKIP(x);? That branchy version was in perl prior to 2012, and you reverted it in commit 4cbf413 https://github.com/Perl/perl5/commit/4cbf4130b603602204e51a4eaae4cab886651611, and the commit message sounds like you just assumed that branchless would be faster. But my previous comment about dependency chains and the inadequacy of Porting/bench.pl applies to this too: The branchy version was actually faster than branchless on any string that has a significant fraction of 1-byte characters, because the branch predictor breaks the loop-carried dependency chain. It's not as good as this PR, but it's better than what we had for the last 10 years.

Since we can't use word-at-a-time everywhere, we could change the definition of UTF8SKIP to apply the branchy version to all the other callsites that are still scalar

Do you mind explaining the point about dependency chains in more detail? I for one would like a better understanding of this so I can avoid such issues in other code.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

Jul 22 '22 08:07 demerphq

https://en.wikipedia.org/wiki/Loop_dependence_analysis

Jul 22 '22 12:07 khwilliamson

I don't see any reason in the feedback to this PR to not merge it. @wollmers if you think the threshold should change to a lower number lets address that in a separate PR/Issue.

Feb 08 '23 03:02 demerphq

@pengvado if you have data that suggests we should use a different threshold or approach could you file it as a separate issue or PR please?

Feb 08 '23 03:02 demerphq

perl5
perl5 copied to clipboard

Use per-word calcs in utf8_length()

perl5 perl5 copied to clipboard

Use per-word calcs in utf8_length()

perl5
perl5 copied to clipboard