vte icon indicating copy to clipboard operation
vte copied to clipboard

utf8 parsing performance

Open ConnyOnny opened this issue 8 years ago • 4 comments

Hi, I was eager to benchmark your table-based utf8 parsing approach against the standard library implementation, so I did: https://github.com/ConnyOnny/utf8perf

If my testing setup is not wrong (see main.rs) it seems branching is not everything.

ConnyOnny avatar Jan 11 '17 11:01 ConnyOnny

Thanks for putting this together! I've been wanting to do some benchmark work.

There were a few problems with your test setup. I opened a PR. That said, the results aren't much better, but at least they are correct!

Read 21078000 bytes.
Parser "tbl" needed a median 0.055256400 seconds to parse 11431500 characters.
Parser "std" needed a median 0.029445756 seconds to parse 11431500 characters.

Going to mark this as a bug because we should be able to be std easily.

jwilm avatar Jan 11 '17 16:01 jwilm

Hi, some years ago I implemented an utf8 decoder with the same table, and used Björn Höhrmann's article as a reference http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for benchmarking. In his version the state/mask table is more compact than the 8*256 bytes used by utf8parse and thus more cache friendly.

carl-erwin avatar Jan 12 '17 10:01 carl-erwin

I've done some minimal optimization effort in #8. When I've got a bit more time, I plan to look into Björn Höhrmann's article mentioned by @carl-erwin to see if we can do better.

As to why the std parser does so much better, this seems due to optimizations available when it's possible to view multiple bytes at once.

jwilm avatar Jul 11 '17 17:07 jwilm

You might also be interested in encoding_rs which is currently shipping in Firefox.

luser avatar Dec 06 '17 14:12 luser