tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Fast encode

Open ArthurZucker opened this issue 8 months ago • 1 comments

Try to make our code faster :)

From inital bench for GPT2:

  • 20% of the time is spent in the pre_tokenizer when doing batch encoding
  • 8% for no cache
  • xx% for added tokens (not 100% sure, gotta remove them and add them again, add other tokens as well)
  • removing ïng" reduce performances by 700% lol

Inital bench results:

    Finished `bench` profile [optimized] target(s) in 32.40s
     Running benches/bert_benchmark.rs (target/release/deps/bert_benchmark-978096f5c7d2a77c)
Gnuplot not found, using plotters backend
Benchmarking WordPiece BERT encode
Benchmarking WordPiece BERT encode: Warming up for 3.0000 s
Benchmarking WordPiece BERT encode: Collecting 20 samples in estimated 5.0031 s (284970 iterations)
Benchmarking WordPiece BERT encode: Analyzing
WordPiece BERT encode   time:   [17.399 µs 17.406 µs 17.416 µs]
                        change: [-2.1128% -1.9745% -1.8658%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [17.399 µs 17.416 µs] R^2            [0.9999587 0.9999530]
mean   [17.413 µs 17.439 µs] std. dev.      [21.867 ns 38.768 ns]
median [17.403 µs 17.448 µs] med. abs. dev. [9.7665 ns 47.692 ns]

Benchmarking WordPiece BERT encode batch
Benchmarking WordPiece BERT encode batch: Warming up for 3.0000 s
Benchmarking WordPiece BERT encode batch: Collecting 20 samples in estimated 5.5509 s (1890 iterations)
Benchmarking WordPiece BERT encode batch: Analyzing
WordPiece BERT encode batch
                        time:   [2.8891 ms 2.8920 ms 2.8945 ms]
                        change: [-19.384% -19.126% -18.887%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [2.8891 ms 2.8945 ms] R^2            [0.9998221 0.9998317]
mean   [2.8851 ms 2.8940 ms] std. dev.      [7.3858 µs 12.625 µs]
median [2.8833 ms 2.8963 ms] med. abs. dev. [4.4561 µs 16.018 µs]

Benchmarking WordPiece Train vocabulary (small)
Benchmarking WordPiece Train vocabulary (small): Warming up for 3.0000 s
Benchmarking WordPiece Train vocabulary (small): Collecting 10 samples in estimated 5.7655 s (220 iterations)
Benchmarking WordPiece Train vocabulary (small): Analyzing
WordPiece Train vocabulary (small)
                        time:   [25.873 ms 25.988 ms 26.085 ms]
                        change: [-1.5674% -0.8238% -0.0549%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
slope  [25.873 ms 26.085 ms] R^2            [0.9990557 0.9991575]
mean   [25.924 ms 26.231 ms] std. dev.      [114.50 µs 334.27 µs]
median [25.868 ms 26.258 ms] med. abs. dev. [64.904 µs 427.20 µs]

Benchmarking WordPiece Train vocabulary (big)
Benchmarking WordPiece Train vocabulary (big): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.8s.
Benchmarking WordPiece Train vocabulary (big): Collecting 10 samples in estimated 7.8009 s (10 iterations)
Benchmarking WordPiece Train vocabulary (big): Analyzing
WordPiece Train vocabulary (big)
                        time:   [770.56 ms 775.31 ms 780.55 ms]
                        change: [-3.4898% -1.7298% -0.1796%] (p = 0.07 > 0.05)
                        No change in performance detected.
mean   [770.56 ms 780.55 ms] std. dev.      [3.8811 ms 11.788 ms]
median [769.56 ms 779.70 ms] med. abs. dev. [387.67 µs 13.909 ms]

     Running benches/bpe_benchmark.rs (target/release/deps/bpe_benchmark-4074fa6f48a53e0d)
Gnuplot not found, using plotters backend
Benchmarking BPE GPT2 encode
Benchmarking BPE GPT2 encode: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode: Collecting 20 samples in estimated 5.0014 s (470190 iterations)
Benchmarking BPE GPT2 encode: Analyzing
BPE GPT2 encode         time:   [10.756 µs 10.764 µs 10.775 µs]
                        change: [-4.0388% -3.9526% -3.8664%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [10.756 µs 10.775 µs] R^2            [0.9999103 0.9998945]
mean   [10.757 µs 10.769 µs] std. dev.      [9.0797 ns 18.260 ns]
median [10.752 µs 10.767 µs] med. abs. dev. [4.5995 ns 20.446 ns]

Benchmarking BPE GPT2 encode batch
Benchmarking BPE GPT2 encode batch: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode batch: Collecting 20 samples in estimated 5.0345 s (1470 iterations)
Benchmarking BPE GPT2 encode batch: Analyzing
BPE GPT2 encode batch   time:   [3.3300 ms 3.3363 ms 3.3424 ms]
                        change: [-7.9889% -7.7761% -7.5698%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [3.3300 ms 3.3424 ms] R^2            [0.9995406 0.9995482]
mean   [3.3300 ms 3.3407 ms] std. dev.      [8.6931 µs 15.591 µs]
median [3.3271 ms 3.3445 ms] med. abs. dev. [6.2592 µs 19.177 µs]

Benchmarking BPE GPT2 encode, no cache
Benchmarking BPE GPT2 encode, no cache: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode, no cache: Collecting 20 samples in estimated 5.0026 s (273420 iterations)
Benchmarking BPE GPT2 encode, no cache: Analyzing
BPE GPT2 encode, no cache
                        time:   [18.453 µs 18.462 µs 18.469 µs]
                        change: [-1.5596% -1.1521% -0.8580%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe
slope  [18.453 µs 18.469 µs] R^2            [0.9999700 0.9999731]
mean   [18.453 µs 18.494 µs] std. dev.      [13.041 ns 83.137 ns]
median [18.450 µs 18.470 µs] med. abs. dev. [9.6415 ns 29.639 ns]

Benchmarking BPE GPT2 encode batch, no cache
Benchmarking BPE GPT2 encode batch, no cache: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode batch, no cache: Collecting 20 samples in estimated 5.5411 s (1680 iterations)
Benchmarking BPE GPT2 encode batch, no cache: Analyzing
BPE GPT2 encode batch, no cache
                        time:   [3.2316 ms 3.2393 ms 3.2461 ms]
                        change: [-17.699% -17.461% -17.213%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
slope  [3.2316 ms 3.2461 ms] R^2            [0.9993385 0.9993742]
mean   [3.2317 ms 3.2450 ms] std. dev.      [10.441 µs 19.642 µs]
median [3.2313 ms 3.2424 ms] med. abs. dev. [4.9099 µs 24.161 µs]

Benchmarking BPE Train vocabulary (small)
Benchmarking BPE Train vocabulary (small): Warming up for 3.0000 s
Benchmarking BPE Train vocabulary (small): Collecting 10 samples in estimated 5.3267 s (220 iterations)
Benchmarking BPE Train vocabulary (small): Analyzing
BPE Train vocabulary (small)
                        time:   [24.407 ms 24.447 ms 24.481 ms]
                        change: [+1.0549% +1.5442% +1.9649%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
slope  [24.407 ms 24.481 ms] R^2            [0.9998512 0.9998621]
mean   [24.336 ms 24.488 ms] std. dev.      [47.237 µs 181.39 µs]
median [24.375 ms 24.509 ms] med. abs. dev. [9.3173 µs 202.16 µs]

Benchmarking BPE Train vocabulary (big)
Benchmarking BPE Train vocabulary (big): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.8s.
Benchmarking BPE Train vocabulary (big): Collecting 10 samples in estimated 7.7637 s (10 iterations)
Benchmarking BPE Train vocabulary (big): Analyzing
BPE Train vocabulary (big)
                        time:   [774.49 ms 794.30 ms 815.15 ms]
                        change: [-0.4729% +2.2009% +4.8075%] (p = 0.14 > 0.05)
                        No change in performance detected.
mean   [774.49 ms 815.15 ms] std. dev.      [21.389 ms 38.322 ms]
median [765.31 ms 832.69 ms] med. abs. dev. [2.5689 ms 52.865 ms]

     Running benches/layout_benchmark.rs (target/release/deps/layout_benchmark-5c3c3bf9f881b17f)
Gnuplot not found, using plotters backend
Benchmarking TemplateProcessing single encode
Benchmarking TemplateProcessing single encode: Warming up for 3.0000 s
Benchmarking TemplateProcessing single encode: Collecting 20 samples in estimated 5.0002 s (5480580 iterations)
Benchmarking TemplateProcessing single encode: Analyzing
TemplateProcessing single encode
                        time:   [609.74 ns 610.92 ns 613.16 ns]
                        change: [-33.129% -31.103% -29.605%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high severe
slope  [609.74 ns 613.16 ns] R^2            [0.9983154 0.9980014]
mean   [611.30 ns 618.26 ns] std. dev.      [2.5256 ns 11.079 ns]
median [609.62 ns 613.20 ns] med. abs. dev. [857.98 ps 5.5359 ns]

Benchmarking TemplateProcessing pair encode
Benchmarking TemplateProcessing pair encode: Warming up for 3.0000 s
Benchmarking TemplateProcessing pair encode: Collecting 20 samples in estimated 5.0000 s (2875110 iterations)
Benchmarking TemplateProcessing pair encode: Analyzing
TemplateProcessing pair encode
                        time:   [1.3108 µs 1.3141 µs 1.3181 µs]
                        change: [-40.953% -38.634% -36.768%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [1.3108 µs 1.3181 µs] R^2            [0.9986700 0.9985652]
mean   [1.3154 µs 1.3266 µs] std. dev.      [9.0906 ns 15.799 ns]
median [1.3126 µs 1.3249 µs] med. abs. dev. [5.7862 ns 20.688 ns]

     Running benches/unigram_benchmark.rs (target/release/deps/unigram_benchmark-b1d455b46edaf1cb)
Gnuplot not found, using plotters backend
Benchmarking Unigram Train vocabulary (small)
Benchmarking Unigram Train vocabulary (small): Warming up for 3.0000 s
Benchmarking Unigram Train vocabulary (small): Collecting 10 samples in estimated 5.0702 s (770 iterations)
Benchmarking Unigram Train vocabulary (small): Analyzing
Unigram Train vocabulary (small)
                        time:   [6.4149 ms 6.4314 ms 6.4432 ms]
                        change: [-2.6275% -1.9549% -1.2429%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild
slope  [6.4149 ms 6.4432 ms] R^2            [0.9996525 0.9997142]
mean   [6.4168 ms 6.4718 ms] std. dev.      [20.996 µs 60.882 µs]
median [6.4078 ms 6.4779 ms] med. abs. dev. [6.2033 µs 81.149 µs]

Benchmarking Unigram Train vocabulary (medium)
Benchmarking Unigram Train vocabulary (medium): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.3s.
Benchmarking Unigram Train vocabulary (medium): Collecting 10 samples in estimated 6.3170 s (10 iterations)
Benchmarking Unigram Train vocabulary (medium): Analyzing
Unigram Train vocabulary (medium)
                        time:   [632.07 ms 634.25 ms 636.24 ms]
                        change: [+1.3224% +1.7548% +2.1929%] (p = 0.00 < 0.05)
                        Performance has regressed.
mean   [632.07 ms 636.24 ms] std. dev.      [1.7710 ms 4.3467 ms]
median [631.05 ms 636.90 ms] med. abs. dev. [469.35 µs 5.9250 ms]

I am checking whether the pre_tokenizer takes a lot of time or not, but mostly I am seing that our merging algorithm is the bottleneck now (appart from this current fix, which earns ~20%). Will dive!

ArthurZucker avatar Jun 20 '24 14:06 ArthurZucker