semantic-kernel icon indicating copy to clipboard operation
semantic-kernel copied to clipboard

GPT3Tokenizer has a static, unbounded cache

Open stephentoub opened this issue 2 years ago • 2 comments

Describe the bug

It appears that: https://github.com/microsoft/semantic-kernel/blob/77259db3e352db64be4b5f08f5c11a391a101e60/dotnet/src/Connectors/Connectors.AI.OpenAI/Tokenizers/GPT3Tokenizer.cs#L20 gets added to every time a new token is discovered and nothing is ever purged. As this is static, this could grow unbounded over the lifetime of a process using GPT3Tokenizer.

To Reproduce

Try running this on .NET 8:

using Microsoft.SemanticKernel.Connectors.AI.OpenAI.Tokenizers;
using System.Security.Cryptography;

_ = Task.Run(() =>
{
    while (true)
    {
        Thread.Sleep(1000);
        GC.Collect();
        GCMemoryInfo m = GC.GetGCMemoryInfo();
        Console.WriteLine($"Heap size (bytes): {m.HeapSizeBytes:N0}");
    }
});

while (true)
{
    string s = RandomNumberGenerator.GetString("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789", Random.Shared.Next(3, 10));
    GPT3Tokenizer.Encode(s);
}

It yields results like this:

Heap size (bytes): 30,185,320
Heap size (bytes): 51,785,912
Heap size (bytes): 66,197,432
Heap size (bytes): 82,965,960
Heap size (bytes): 106,357,800
Heap size (bytes): 123,904,072
Heap size (bytes): 137,677,328
Heap size (bytes): 148,796,904
Heap size (bytes): 165,457,976
Heap size (bytes): 175,242,392
Heap size (bytes): 222,859,336
Heap size (bytes): 201,526,368
Heap size (bytes): 215,845,504
Heap size (bytes): 230,380,032
Heap size (bytes): 242,710,736
Heap size (bytes): 257,402,096
Heap size (bytes): 271,899,808
Heap size (bytes): 284,218,336
Heap size (bytes): 298,299,408
Heap size (bytes): 312,281,368
Heap size (bytes): 326,492,640
Heap size (bytes): 342,140,560
Heap size (bytes): 351,787,344
Heap size (bytes): 362,963,376
Heap size (bytes): 468,959,808
Heap size (bytes): 400,798,336
Heap size (bytes): 416,996,472
Heap size (bytes): 426,833,760
Heap size (bytes): 442,195,872
Heap size (bytes): 455,267,496
Heap size (bytes): 466,229,968
Heap size (bytes): 478,892,096
Heap size (bytes): 496,607,360
Heap size (bytes): 508,266,064
Heap size (bytes): 517,486,768
Heap size (bytes): 533,386,128
Heap size (bytes): 546,827,232
Heap size (bytes): 559,574,720
Heap size (bytes): 573,761,096
Heap size (bytes): 582,471,888
Heap size (bytes): 598,591,408
Heap size (bytes): 605,494,024
Heap size (bytes): 618,311,400
Heap size (bytes): 629,957,848
Heap size (bytes): 647,155,912
Heap size (bytes): 655,306,240
Heap size (bytes): 672,447,504
Heap size (bytes): 678,559,088
... and on and on

with the heap growing unbounded because that cache grows unbounded.

Expected behavior

A different design where the cache isn't static, e.g. maybe the tokenizer is changed to be an instance, with the cache being an instance field, and the owner of the tokenizer instance can decide when to throw it away. Or some limit on the size of the cache with it throwing away items when it gets too big.

cc: @dluc, @shawncal

stephentoub avatar Apr 21 '23 16:04 stephentoub

thanks, I'll look into this. By the way we're also working on a new tokenizer, bringing in tiktoken, would love having this kind of contributions when we're ready :-)

dluc avatar Apr 22 '23 04:04 dluc

options:

  • P1: max size, hardcoded or configurable
  • P2: allow to throw away the cache, e.g. keep it in the instance

dluc avatar Apr 26 '23 16:04 dluc