Is PHP cursed to be much slower?
Hey, thanks for porting this over! I wanted to move to PHP to remove an extra dependency (docker server exposing Python TikToken over API). I decided to do a small benchmark and it seems that PHP version is greatly slower.
Source for Docker service: https://github.com/flexchar/tiktoken-counter
I use Laravel. I wrote a simple command to tokenize a 100 sentence long text 1000 times.
Median output is around:
Docker time: 4.5049350261688 seconds
PHP time: 20.138854026794 seconds
<?php
namespace App\Console\Commands;
use Illuminate\Console\Command;
use Illuminate\Support\Facades\Http;
class BenchmarkTikToken extends Command
{
/**
* The name and signature of the console command.
*
* @var string
*/
protected $signature = 'app:benchmark-tik-token';
/**
* The console command description.
*
* @var string
*/
protected $description = 'Benchmark PHP version of TikToken vs. Python using Docker image';
// Store initialized tokenizer
public \Yethee\Tiktoken\Encoder $encoder;
/**
* Execute the console command.
*/
public function handle(): void
{
$this->warn('Make sure to `composer require yethee/tiktoken`.');
$timesToIterate = 1000;
$text = Http::get(
'https://baconipsum.com/api/?type=meat-and-filler¶s=100&format=text',
)
->throw()
->body();
// Warm up the functions
$provider = app(\Yethee\Tiktoken\EncoderProvider::class);
$this->encoder = $provider->getForModel('gpt-4');
$this->countTokens('hello world');
$this->countTokensPhp('hello world');
// Benchmark the functions
$countTokensTime = $this->benchmark(function () use ($text, $timesToIterate) {
foreach (range(1, $timesToIterate) as $_iteration) {
$this->countTokens($text);
}
});
$countTokensPhpTime = $this->benchmark(function () use ($text, $timesToIterate) {
foreach (range(1, $timesToIterate) as $_iteration) {
$this->countTokensPhp($text);
}
});
// Print the results
$this->line("Docker time: {$countTokensTime} seconds");
$this->line("PHP time: {$countTokensPhpTime} seconds");
}
private function benchmark(callable $function): float
{
$start = microtime(true);
$function();
$end = microtime(true);
return $end - $start;
}
public function countTokensPhp(string $text): int
{
$tokens = $this->encoder->encode($text);
return count($tokens);
}
public function countTokens(string $text): int
{
$tokens = Http::post('tiktoken:8000/count', [
'text' => $text,
])
->throw()
->json('tokens');
return (int) ceil($tokens * 1.05);
}
}
Running PHP 8.2.10 on Docker on M2.
Hi!
Thanks for the report. I can confirm that PHP implementation is less performant than the original library.
In the tiktoken library, the core logic is written in rust. I don't think you can get comparable performance from PHP.
$ phpbench run src/EncodeBench.php --report=aggregate --php-config='{"zend.assertions":-1}'
PHPBench (1.2.14) running benchmarks...
with configuration file: /var/bench/phpbench.json
with PHP version 8.1.24, xdebug ❌, opcache ❌
\Benchmark\EncodeBench
benchPHPImplementation..................I4 - Mo22.591ms (±1.66%)
benchRPCCounter.........................I4 - Mo4.339ms (±2.95%)
Subjects: 2, Assertions: 0, Failures: 0, Errors: 0
+-------------+------------------------+-----+------+-----+----------+----------+--------+
| benchmark | subject | set | revs | its | mem_peak | mode | rstdev |
+-------------+------------------------+-----+------+-----+----------+----------+--------+
| EncodeBench | benchPHPImplementation | | 100 | 5 | 18.606mb | 22.591ms | ±1.66% |
| EncodeBench | benchRPCCounter | | 100 | 5 | 18.606mb | 4.339ms | ±2.95% |
+-------------+------------------------+-----+------+-----+----------+----------+--------+
We need to further investigate the issue to understand whether optimization is possible.
Benchmark code
<?php
// src/EncodeBench.php
namespace Benchmark;
use PhpBench\Attributes as Bench;
use Symfony\Component\HttpClient\HttpClient;
use Symfony\Contracts\HttpClient\HttpClientInterface;
use Yethee\Tiktoken\Encoder;
use Yethee\Tiktoken\EncoderProvider;
final class EncodeBench
{
private HttpClientInterface $httpClient;
private Encoder $encoder;
private string $text;
public function __construct()
{
$provider = new EncoderProvider();
$httpClient = HttpClient::create();
$this->encoder = $provider->get('cl100k_base');
$this->httpClient = $httpClient;
$this->text = $httpClient
->request('GET', 'https://baconipsum.com/api/?type=meat-and-filler¶s=100&format=text')
->getContent();
}
#[Bench\Iterations(5)]
#[Bench\Revs(100)]
#[Bench\Warmup(1)]
public function benchPHPImplementation(): void
{
count($this->encoder->encode($this->text));
}
#[Bench\Iterations(5)]
#[Bench\Revs(100)]
#[Bench\Warmup(1)]
public function benchRPCCounter(): void
{
$this->httpClient
->request('POST', 'http://tiktoken-counter:8000/count', [
'json' => [
'text' => $this->text,
'encoding' => 'cl100k_base',
]
])
->toArray()['tokens'];
}
}
version: "3.7"
services:
bench:
build:
dockerfile: docker/Dockerfile
depends_on:
- tiktoken-counter
working_dir: "/var/bench"
volumes:
- ".:/var/bench"
tiktoken-counter:
image: ghcr.io/flexchar/tiktoken-counter
expose:
- "8000"
In that case something like PHP-FFI with a native implementation in C++ could be more fair game! I see there are several implementations https://github.com/sewenew/tokenizer
I updated implementation, this allowed to speed up converting text into tokens ~ 2 times, for your example. You can check #10 for details.
That is extra ordinary work! I also had a thought that it could perhaps be possible by calling tiktoken written in C++ using PHP FFI. While I understand the overview, that is sadly far beyond my skillset.✌️