tiktoken-php Is PHP cursed to be much slower?

Hey, thanks for porting this over! I wanted to move to PHP to remove an extra dependency (docker server exposing Python TikToken over API). I decided to do a small benchmark and it seems that PHP version is greatly slower.

Source for Docker service: https://github.com/flexchar/tiktoken-counter

I use Laravel. I wrote a simple command to tokenize a 100 sentence long text 1000 times.

Median output is around:

Docker time: 4.5049350261688 seconds
PHP time: 20.138854026794 seconds

<?php

namespace App\Console\Commands;

use Illuminate\Console\Command;
use Illuminate\Support\Facades\Http;

class BenchmarkTikToken extends Command
{
    /**
     * The name and signature of the console command.
     *
     * @var string
     */
    protected $signature = 'app:benchmark-tik-token';

    /**
     * The console command description.
     *
     * @var string
     */
    protected $description = 'Benchmark PHP version of TikToken vs. Python using Docker image';

    // Store initialized tokenizer
    public \Yethee\Tiktoken\Encoder $encoder;

    /**
     * Execute the console command.
     */
    public function handle(): void
    {
        $this->warn('Make sure to `composer require yethee/tiktoken`.');

        $timesToIterate = 1000;
        $text = Http::get(
            'https://baconipsum.com/api/?type=meat-and-filler&paras=100&format=text',
        )
            ->throw()
            ->body();

        // Warm up the functions
        $provider = app(\Yethee\Tiktoken\EncoderProvider::class);
        $this->encoder = $provider->getForModel('gpt-4');
        $this->countTokens('hello world');
        $this->countTokensPhp('hello world');

        // Benchmark the functions
        $countTokensTime = $this->benchmark(function () use ($text, $timesToIterate) {
            foreach (range(1, $timesToIterate) as $_iteration) {
                $this->countTokens($text);
            }
        });

        $countTokensPhpTime = $this->benchmark(function () use ($text, $timesToIterate) {
            foreach (range(1, $timesToIterate) as $_iteration) {
                $this->countTokensPhp($text);
            }
        });

        // Print the results
        $this->line("Docker time: {$countTokensTime} seconds");
        $this->line("PHP time: {$countTokensPhpTime} seconds");
    }

    private function benchmark(callable $function): float
    {
        $start = microtime(true);
        $function();
        $end = microtime(true);

        return $end - $start;
    }

    public function countTokensPhp(string $text): int
    {
        $tokens = $this->encoder->encode($text);

        return count($tokens);
    }

    public function countTokens(string $text): int
    {
        $tokens = Http::post('tiktoken:8000/count', [
            'text' => $text,
        ])
            ->throw()
            ->json('tokens');

        return (int) ceil($tokens * 1.05);
    }
}

Running PHP 8.2.10 on Docker on M2.

Oct 12 '23 11:10 flexchar

Hi!

Thanks for the report. I can confirm that PHP implementation is less performant than the original library.

In the tiktoken library, the core logic is written in rust. I don't think you can get comparable performance from PHP.

$ phpbench run src/EncodeBench.php --report=aggregate --php-config='{"zend.assertions":-1}'
PHPBench (1.2.14) running benchmarks...
with configuration file: /var/bench/phpbench.json
with PHP version 8.1.24, xdebug ❌, opcache ❌

\Benchmark\EncodeBench

    benchPHPImplementation..................I4 - Mo22.591ms (±1.66%)
    benchRPCCounter.........................I4 - Mo4.339ms (±2.95%)

Subjects: 2, Assertions: 0, Failures: 0, Errors: 0
+-------------+------------------------+-----+------+-----+----------+----------+--------+
| benchmark   | subject                | set | revs | its | mem_peak | mode     | rstdev |
+-------------+------------------------+-----+------+-----+----------+----------+--------+
| EncodeBench | benchPHPImplementation |     | 100  | 5   | 18.606mb | 22.591ms | ±1.66% |
| EncodeBench | benchRPCCounter        |     | 100  | 5   | 18.606mb | 4.339ms  | ±2.95% |
+-------------+------------------------+-----+------+-----+----------+----------+--------+

We need to further investigate the issue to understand whether optimization is possible.

Benchmark code

<?php

// src/EncodeBench.php

namespace Benchmark;

use PhpBench\Attributes as Bench;
use Symfony\Component\HttpClient\HttpClient;
use Symfony\Contracts\HttpClient\HttpClientInterface;
use Yethee\Tiktoken\Encoder;
use Yethee\Tiktoken\EncoderProvider;

final class EncodeBench
{
    private HttpClientInterface $httpClient;
    private Encoder $encoder;
    private string $text;

    public function __construct()
    {
        $provider = new EncoderProvider();
        $httpClient = HttpClient::create();

        $this->encoder = $provider->get('cl100k_base');
        $this->httpClient = $httpClient;

        $this->text = $httpClient
            ->request('GET', 'https://baconipsum.com/api/?type=meat-and-filler&paras=100&format=text')
            ->getContent();
    }

    #[Bench\Iterations(5)]
    #[Bench\Revs(100)]
    #[Bench\Warmup(1)]
    public function benchPHPImplementation(): void
    {
        count($this->encoder->encode($this->text));
    }

    #[Bench\Iterations(5)]
    #[Bench\Revs(100)]
    #[Bench\Warmup(1)]
    public function benchRPCCounter(): void
    {
        $this->httpClient
            ->request('POST', 'http://tiktoken-counter:8000/count', [
                'json' => [
                    'text' => $this->text,
                    'encoding' => 'cl100k_base',
                ]
            ])
            ->toArray()['tokens'];
    }
}

version: "3.7"
services:
  bench:
    build:
      dockerfile: docker/Dockerfile
    depends_on:
      - tiktoken-counter
    working_dir: "/var/bench"
    volumes:
      - ".:/var/bench"

  tiktoken-counter:
    image: ghcr.io/flexchar/tiktoken-counter
    expose:
      - "8000"

Oct 12 '23 16:10 yethee

In that case something like PHP-FFI with a native implementation in C++ could be more fair game! I see there are several implementations https://github.com/sewenew/tokenizer

Oct 12 '23 17:10 flexchar

I updated implementation, this allowed to speed up converting text into tokens ~ 2 times, for your example. You can check #10 for details.

Apr 30 '24 14:04 yethee

That is extra ordinary work! I also had a thought that it could perhaps be possible by calling tiktoken written in C++ using PHP FFI. While I understand the overview, that is sadly far beyond my skillset.✌️

May 01 '24 10:05 flexchar