l0rinc
l0rinc
re-ACK 7f3f6c6dc80247e6dfb0d406dc53bc8198f029fd
Thanks a lot for the thorough review, Shantanu. Let me know if you need any help in speeding up the process! :) After merging you may want to update the...
@hauntsaninja, I've rebased this PR, removing the merged commits and adjusting the result a bit based on your previous preferences. Hope it helps. Feel free to push on top of...
I've check every single JSON file in the MATH and AMPS pretraining datasets, all of them were passing this test. ```Python def test_math_roundtrips(): enc = tiktoken.get_encoding("cl100k_base") base_dir = '.../Downloads/amps' for...
Fixed init in my recently pushed PRs for `cl100k_base` (see https://github.com/openai/tiktoken/pull/234 and https://github.com/openai/tiktoken/pull/239) - and the backported possessives quantifiers to the legacy encoding in https://github.com/openai/tiktoken/pull/258 Cherrypicking the PRs on top...
You could add the PRs mentioned in https://github.com/openai/tiktoken/issues/245#issuecomment-1937894067 and build a custom TikToken version that supports big tokens. @hauntsaninja, do you think we could merge some of those PRs instead?
This can be solved by a different rank merge algorithm - I've fixed it in https://github.com/knuddelsgmbh/jtokkit (new version not released yet), might also contribute it back to tiktoken if it's...
I pushed a PR here as well to tackle this exact scenario, see: https://github.com/openai/tiktoken/pull/239
I'm currently working on optimizing the tokenizer and the token counter (on the Java implementation at https://github.com/knuddelsgmbh/jtokkit, but most of the tricks should be applicable to other implementations as well)....
Nice, let me know how I can help!