Ben Perry

Results 30 comments of Ben Perry

``` def length_function(text): return len(tokenizer.encode(text, add_special_tokens=False)) splitter = RecursiveTextChararcterSplitter(length_function=length_function, ...) ``` should do the trick

Also worth noting, the recursive splitter does tend towards over-splitting. A split of exactly chunk length will be split again if it can be (not sure why that was chosen,...

@Wolfsauge Did you mean https://github.com/langchain-ai/langchain/pull/5583? Your results are interesting. Are the different documents all the same? Doesn't quite make sense if they are, since I would expect the first and...

This wasn't an official feature, just something I put together based on my own observations. And since it wasn't getting any traction from reviewers I moved on to other things....

That's just a warning from the huggingface tokenizer. It tokenizes the full text in order to determine where to split it, then splits down to chunk size. You can safely...

Less familiar with tiktoken, but looking at the function def it appears to be doing the right thing (note the `_tiktoken_encoder` function that gets passed into `length_function` for the splitter)....

Mypy raises errors about `MutableDict.as_mutable(JSONB)` after upgrading to sqlalchemy 2.x (with postgresql dialect) ``` Argument 1 to "as_mutable" of "Mutable" has incompatible type "Type[JSONB]"; expected "TypeEngine[Any]" [arg-type] ``` Is this...

I would really like to have something equivalent to `ingress.kubernetes.io/custom-request-headers` back in v2. Want to add a custom header set separately for each IngressRoute, but it does not seem scalable...

Did a little testing, looks like traefik's memory usage grows ~25MiB per 1000 header middlewares that each add a single header. So that's 25KiB per 50 bytes of actual data...