Decode regression
This is not a recent regression, and perhaps it won't be fixed for that reason, but I thought I'd file it anyway.
I maintain Go bindings for this library, and by sheer luck I had benchmarks when I started. At some point I've noticed a regression in decoding, but only now got around to investigating it. Long story short, I've bisected this repo and root caused it to this PR. Below is a benchmark used to find it.
Regression details:
decode time: [3.9277 µs 3.9409 µs 3.9558 µs]
change: [+241.37% +242.64% +244.06%] (p = 0.00 < 0.05)
Performance has regressed.
While decode is pretty fast (order of microseconds), +240% slowdown is fairly big and I wonder if we can gain back that performance.
Benchmark code (tokenizers/benches/decode_benchmark.rs):
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use tokenizers::tokenizer::Tokenizer;
fn decode(tokenizer:&Tokenizer, ids_slice: Vec<u32>, skip_special_tokens: bool) -> String {
tokenizer.decode(ids_slice, skip_special_tokens).expect("failed to decode input")
}
fn criterion_benchmark(c: &mut Criterion) {
let tokenizer = Tokenizer::from_file("./test/data/bert-base-uncased.json").expect("failed to create tokenizer");
c.bench_function("decode",
|b| b.iter(
|| decode(&tokenizer, black_box([2829, 4419, 14523, 2058, 1996, 13971, 3899].to_vec()), black_box(true))));
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
Add this to Cargo.toml and run with cargo bench decode.
[[bench]]
name = "decode_benchmark"
harness = false
The tokenizer file is copied from here.
Ow wow, thanks a mile for the clear report! Yeah 240% is not really good! Okay, I'lll investigate but in the mean time if you have a PR with a fix, would be super good!
I'll add this to benches! 🆙
I checked ou the commit you mention, had to create a custom branch to test: test-old-decode, cherry picked the test that I added in fast-decode-fix has just the benchmark, will post the results here!
for now seems like the issue was fixed in between
you mean you don't see a regression at head? I just synced up to confirm performance drop. Don't look at change % as that seems to report changes since last run of the benchmark, just compare absolute values, e.g. here is what I got on two consecutive runs on head:
decode time: [3.8542 µs 3.8795 µs 3.9159 µs]
change: [+234.61% +236.83% +240.08%] (p = 0.00 < 0.05)
Performance has regressed.
decode time: [3.7643 µs 3.7762 µs 3.7891 µs]
change: [-3.8222% -2.8277% -2.1576%] (p = 0.00 < 0.05)
Performance has improved.
I'll check again! I was getting 630ns for both branches but let me make sure of that!
oh, i just checked test-old-code branch, and it contains the commit (#938) that introduced the regression. You want to branch one commit earlier than that :)
Yeah I cherry picked it I think, to run before / after I'll checka gain
Since encode perf was improved in 0.20 release, any plans to look at this?
Yep for sure 😉 Ad you've seen was more focused on encode but will pick this back up
@ArthurZucker any updates? :)
@ArthurZucker
You were not able to reproduce the regression because the tokenizer in the test-old-code branch is missing a decoder and is not the original one supplied in this issue. I can reproduce the slowdown, but short of changing the API to one maybe based on iterators, it doesn't seem like there is a simple fix. The decoder API changed to one returning a Vec<String> instead of String so some regression is expected.
With the changes from #1707 the regression is a bit lower, +210%.
Yep, sorry I don't really have much BW to go back to this now! Super sorry for not answering sooner. Again, if anyone wants to have a look, would help!