rust-encoding
rust-encoding copied to clipboard
Performance: Consider replacing lookup tables with match statements or binary search in single byte index
The current technique for building the single byte "forward" and "backward" function is to generate lookup tables using gen_index.py
Here's an example generated file: https://github.com/lifthrasiir/rust-encoding/blob/master/src/index/singlebyte/windows_1252.rs
There are some benchmarks that are generated, but they're micro-benchmarks with synthetic data, and I'm not sure they adequately capture how the library would be used in the wild.
So I wrote a few tiny benchmarks that exercise the encoder and decoder at the level they're typically used.
/// Some Latin-1 text to test
//
// the first few sentences of the article "An Ghaeilge" from Irish Wikipedia.
// https://ga.wikipedia.org/wiki/An_Ghaeilge
pub static IRISH_TEXT: &'static str =
"Is ceann de na teangacha Ceilteacha í an Ghaeilge (nó Gaeilge na hÉireann mar a thugtar \
uirthi corruair), agus ceann den dtrí cinn de theangacha Ceilteacha ar a dtugtar na \
teangacha Gaelacha (.i. an Ghaeilge, Gaeilge na hAlban agus Gaeilge Mhanann) go háirithe. \
Labhraítear in Éirinn go príomha í, ach tá cainteoirí Gaeilge ina gcónaí in áiteanna eile ar \
fud an domhain. Is í an teanga náisiúnta nó dhúchais agus an phríomhtheanga oifigiúil i \
bPoblacht na hÉireann í an Ghaeilge. Tá an Béarla luaite sa Bhunreacht mar theanga oifigiúil \
eile. Tá aitheantas oifigiúil aici chomh maith i dTuaisceart Éireann, atá mar chuid den \
Ríocht Aontaithe. Ar an 13 Meitheamh 2005 d'aontaigh airí gnóthaí eachtracha an Aontais \
Eorpaigh glacadh leis an nGaeilge mar theanga oifigiúil oibre san AE";
pub static RUSSIAN_TEXT: &'static str =
"Ру?сский язы?к Информация о файле слушать)[~ 3] один из восточнославянских языков, \
национальный язык русского народа. Является одним из наиболее распространённых языков мира \
шестым среди всех языков мира по общей численности говорящих и восьмым по численности \
владеющих им как родным[9]. Русский является также самым распространённым славянским \
языком[10] и самым распространённым языком в Европе ? географически и по числу носителей \
языка как родного[7]. Русский язык ? государственный язык Российской Федерации, один из \
двух государственных языков Белоруссии, один из официальных языков Казахстана, Киргизии и \
некоторых других стран, основной язык международного общения в Центральной Евразии, в \
Восточной Европе, в странах бывшего Советского Союза, один из шести рабочих языков ООН, \
ЮНЕСКО и других международных организаций[11][12][13].";
#[bench]
fn bench_encode_irish(bencher: &mut test::Bencher) {
bencher.bytes = IRISH_TEXT.len() as u64;
bencher.iter(|| {
test::black_box(
WINDOWS_1252.encode(&ASCII_TEXT, EncoderTrap::Strict)
)
})
}
#[bench]
fn bench_decode_irish(bencher: &mut test::Bencher) {
let bytes = WINDOWS_1252.encode(IRISH_TEXT, EncoderTrap::Strict).unwrap();
bencher.bytes = bytes.len() as u64;
bencher.iter(|| {
test::black_box(
WINDOWS_1252.decode(&bytes, DecoderTrap::Strict)
)
})
}
#[bench]
fn bench_encode_russian(bencher: &mut test::Bencher) {
bencher.bytes = RUSSIAN_TEXT.len() as u64;
bencher.iter(|| {
test::black_box(
ISO_8859_5.encode(&RUSSIAN_TEXT, EncoderTrap::Strict)
)
})
}
#[bench]
fn bench_decode_russian(bencher: &mut test::Bencher) {
let bytes = ISO_8859_5.encode(RUSSIAN_TEXT, EncoderTrap::Strict).unwrap();
bencher.bytes = bytes.len() as u64;
bencher.iter(|| {
test::black_box(
ISO_8859_5.decode(&bytes, DecoderTrap::Strict)
)
})
}
I picked the windows-1252
encoding because it's similar to the old latin-1
standard and can encode the special characters in the Irish text I grabbed, and iso-8859-5
for similar reasons for the Russian test.
I rewrote gen_index.py
to create match
statements instead of building a lookup table. You get something like this:
// AUTOGENERATED FROM index-windows-1252.txt, ORIGINAL COMMENT FOLLOWS:
//
// For details on index index-windows-1252.txt see the Encoding Standard
// https://encoding.spec.whatwg.org/
//
// Identifier: e56d49d9176e9a412283cf29ac9bd613f5620462f2a080a84eceaf974cfa18b7
// Date: 2018-01-06
#[inline]
pub fn forward(code: u8) -> Option<u16> {
match code {
128 => Some(8364),
129 => Some(129),
130 => Some(8218),
131 => Some(402),
132 => Some(8222),
133 => Some(8230),
134 => Some(8224),
135 => Some(8225),
136 => Some(710),
137 => Some(8240),
// a bunch more items
250 => Some(250),
251 => Some(251),
252 => Some(252),
253 => Some(253),
254 => Some(254),
255 => Some(255),
_ => None
}
}
#[inline]
pub fn backward(code: u32) -> Option<u8> {
match code {
8364 => Some(128),
129 => Some(129),
8218 => Some(130),
402 => Some(131),
8222 => Some(132),
8230 => Some(133),
8224 => Some(134),
8225 => Some(135),
710 => Some(136),
8240 => Some(137),
352 => Some(138),
8249 => Some(139),
338 => Some(140),
141 => Some(141),
381 => Some(142),
// a bunch more items
251 => Some(251),
252 => Some(252),
253 => Some(253),
254 => Some(254),
255 => Some(255),
_ => None
}
}
Note that I changed the function signature to return an Option
instead of a sentinel value. That wasn't strictly required, and didn't have a large effect on performance, but makes the code more idiomatic, I think.
I also generated a version that uses a binary search. It's pretty simple.
const BACKWARD_KEYS: &'static [u32] = &[
128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
188, 189, 190, 215, 247, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499,
1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 8206,
8207, 8215
];
const BACKWARD_VALUES: &'static [u8] = &[
128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 162, 163, 164, 165, 166,
167, 168, 169, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 187,
188, 189, 190, 170, 186, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237,
238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 253, 254, 223
];
#[inline]
pub fn backward(code: u32) -> u8 {
if let Ok(index) = BACKWARD_KEYS.binary_search(&code) {
BACKWARD_VALUES[index]
} else {
0
}
}
Here's a table comparing the three techniques (scroll to see entire table):
test | master | match | binary search | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
codec::singlebyte::tests::bench_decode_irish | 3246 | ns/iter | 240 | MB/s | 3171 | ns/iter | 245 | MB/s | 2.08% | |||||
codec::singlebyte::tests::bench_decode_russian | 8508 | ns/iter | 98 | MB/s | 8890 | ns/iter | 94 | MB/s | -4.08% | |||||
codec::singlebyte::tests::bench_encode_irish | 2622 | ns/iter | 310 | MB/s | 1688 | ns/iter | 482 | MB/s | 55.48% | 2243 | ns/iter | 363 | MB/s | 17.10% |
codec::singlebyte::tests::bench_encode_russian | 6692 | ns/iter | 228 | MB/s | 10578 | ns/iter | 144 | MB/s | -36.84% | 10019 | ns/iter | 152 | MB/s | -33.33% |
Obviously the Irish / Windows-1252 case is improved with both alternative techniques, but the Russian case is degraded.
It looks like the decode method isn't changed much, and that makes sense, because the match expressions are contiguous integers, I bet that LLVM is optimizing that down to a lookup table anyways.
I'll try running some more tests.