tokenizers
tokenizers copied to clipboard
Adding Trie for WordPiece for faster encoding (Bert).
Seems a little unsatisfying, need to dig a little into this to check. Probably linked to more allocs (since the need for a prefix to differentiate continuing subword_prefix vs not in the Trie, there could be other ways to solve this in a cleaner way.)
Seems to be really linked to the algo itself. Went full aho-corasick and still the benchmark is slower. Either the bench is not representative, or there are enough small chunks that doing only lookups is faster than actually iterating over the string.
Benchmarking WordPiece BERT encode: Collecting 20 samples in estimated 5.0029 s (198k iteration WordPiece BERT encode time: [24.729 us 25.178 us 25.725 us]
change: [+11.588% +16.472% +21.134%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
1 (5.00%) high mild
1 (5.00%) high severe
Benchmarking WordPiece BERT encode batch: Collecting 20 samples in estimated 6.8892 s (630 iter WordPiece BERT encode batch
time: [10.152 ms 10.422 ms 10.668 ms]
change: [+3.7619% +7.9638% +12.317%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking WordPiece Train vocabulary (small): Collecting 10 samples in estimated 6.0906 s (2 WordPiece Train vocabulary (small)
time: [21.804 ms 22.088 ms 22.507 ms]
change: [-2.8880% +2.4280% +7.8018%] (p = 0.39 > 0.05)
No change in performance detected.
Benchmarking WordPiece Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.7s.
Benchmarking WordPiece Train vocabulary (big): Collecting 10 samples in estimated 9.7402 s (10 WordPiece Train vocabulary (big)
time: [990.57 ms 997.72 ms 1.0050 s]
change: [+2.9452% +6.7022% +10.461%] (p = 0.00 < 0.05)
Performance has regressed.Benchmarking WordPiece BERT encode: Collecting 20 samples in estimated 5.0029 s (198k iteration WordPiece BERT encode time: [24.729 us 25.178 us 25.725 us]
change: [+11.588% +16.472% +21.134%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
1 (5.00%) high mild
1 (5.00%) high severe
Benchmarking WordPiece BERT encode batch: Collecting 20 samples in estimated 6.8892 s (630 iter WordPiece BERT encode batch
time: [10.152 ms 10.422 ms 10.668 ms]
change: [+3.7619% +7.9638% +12.317%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking WordPiece Train vocabulary (small): Collecting 10 samples in estimated 6.0906 s (2 WordPiece Train vocabulary (small)
time: [21.804 ms 22.088 ms 22.507 ms]
change: [-2.8880% +2.4280% +7.8018%] (p = 0.39 > 0.05)
No change in performance detected.
Benchmarking WordPiece Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.7s.
Benchmarking WordPiece Train vocabulary (big): Collecting 10 samples in estimated 9.7402 s (10 WordPiece Train vocabulary (big)
time: [990.57 ms 997.72 ms 1.0050 s]
change: [+2.9452% +6.7022% +10.461%] (p = 0.00 < 0.05)
Performance has regressed.