tiktoken
tiktoken copied to clipboard
Any introduction about api `encode_with_unstable`?
https://github.com/openai/tiktoken/blob/095924e02c85617df6889698d94515f91666c7ea/src/lib.rs#L524
Hello, I'm reading the lib.rs code and found the encode_with_unstable
api, tt donesn't seem to be used in the documentation?
But it occupied so much in the lib.rs, and the comments in code don't explain Why and What.
So maybe some extra explanation?
This is a great question. I have some nice internal documentation explaining what problem this is solving, I'll see if I can make a version of it that doesn't include internal-only details.
Any update on this? I'm working on a PR for this repo and need to make sure I don't break encode_with_unstable
. I think I get the main point that if you're splitting text arbitrarily, not necessarily aligned with the regex spits, the tokens at the boundaries where the split occurs might end up different than if the whole string were tokenized as one. But it would help to get some more backstory on the motivation for this and the use-cases that it's serving.