l0rinc comments

Results 251 comments of


                                            l0rinc

trafficstars

refactor: Model the bech32 charlimit as an Enum

re-ACK 7f3f6c6dc80247e6dfb0d406dc53bc8198f029fd

Optimize byte pair merge for really big tokens (40x faster for a 2500 token word)

Thanks a lot for the thorough review, Shantanu. Let me know if you need any help in speeding up the process! :) After merging you may want to update the...

Optimize byte pair merge for really big tokens (40x faster for a 2500 token word)

@hauntsaninja, I've rebased this PR, removing the merged commits and adjusting the result a bit based on your previous preferences. Hope it helps. Feel free to push on top of...

Panic (stack overflow) when encoding a certain string

I've check every single JSON file in the MATH and AMPS pretraining datasets, all of them were passing this test. ```Python def test_math_roundtrips(): enc = tiktoken.get_encoding("cl100k_base") base_dir = '.../Downloads/amps' for...

Panic (stack overflow) when encoding a certain string

Fixed init in my recently pushed PRs for `cl100k_base` (see https://github.com/openai/tiktoken/pull/234 and https://github.com/openai/tiktoken/pull/239) - and the backported possessives quantifiers to the legacy encoding in https://github.com/openai/tiktoken/pull/258 Cherrypicking the PRs on top...

Panic (stack overflow) when encoding a certain string

You could add the PRs mentioned in https://github.com/openai/tiktoken/issues/245#issuecomment-1937894067 and build a custom TikToken version that supports big tokens. @hauntsaninja, do you think we could merge some of those PRs instead?

Very slow for inputs like "a" * 100000

This can be solved by a different rank merge algorithm - I've fixed it in https://github.com/knuddelsgmbh/jtokkit (new version not released yet), might also contribute it back to tiktoken if it's...

Very slow for inputs like "a" * 100000

I pushed a PR here as well to tackle this exact scenario, see: https://github.com/openai/tiktoken/pull/239

Performance ideas

I'm currently working on optimizing the tokenizer and the token counter (on the Java implementation at https://github.com/knuddelsgmbh/jtokkit, but most of the tricks should be applicable to other implementations as well)....

Refresh instead of timing out

Nice, let me know how I can help!