Roland Haller comments

Results 14 comments of


                                            Roland Haller

trafficstars

bug: EmbeddingBuilder::build() Exceeds OpenAI Token Limit Due to Lack of Token-Based Chunking

After doing some quick tests on [Tiktoken](https://platform.openai.com/tokenizer), I get the following for OpenAI: 1. Alphabetical languages (English, French, Russian…): Token number = Text * ~24% 2. Japanese and Chinese: Text...

bug: EmbeddingBuilder::build() Exceeds OpenAI Token Limit Due to Lack of Token-Based Chunking

To reduce the guess work, and limit the performance hit, we could process the first chunk of each document through TikToken and use that result as an estimation that would...

bug: EmbeddingBuilder::build() Exceeds OpenAI Token Limit Due to Lack of Token-Based Chunking

Would it be a good idea to have a *split_and_retry* on the [build method](https://github.com/0xPlaygrounds/rig/blob/cd7e7097d393bf46ae8d2bcfc796e3222250d365/rig-core/src/embeddings/builder.rs#L111)? I tried the following (with OpenAI) and it works rather well. That would allow for a...

[FEATURE REQUEST] Support using patterns for jump

Seems open to me.