Ben Perry comments

Results 30 comments of


                                            Ben Perry

Split by Tokens instead of characters: RecursiveCharacterTextSplitter

``` def length_function(text): return len(tokenizer.encode(text, add_special_tokens=False)) splitter = RecursiveTextChararcterSplitter(length_function=length_function, ...) ``` should do the trick

Split by Tokens instead of characters: RecursiveCharacterTextSplitter

Also worth noting, the recursive splitter does tend towards over-splitting. A split of exactly chunk length will be split again if it can be (not sure why that was chosen,...

Split by Tokens instead of characters: RecursiveCharacterTextSplitter

ah yeah forgot that part

Split by Tokens instead of characters: RecursiveCharacterTextSplitter

@Wolfsauge Did you mean https://github.com/langchain-ai/langchain/pull/5583? Your results are interesting. Are the different documents all the same? Doesn't quite make sense if they are, since I would expect the first and...

Split by Tokens instead of characters: RecursiveCharacterTextSplitter

This wasn't an official feature, just something I put together based on my own observations. And since it wasn't getting any traction from reviewers I moved on to other things....

Split by Tokens instead of characters: RecursiveCharacterTextSplitter

That's just a warning from the huggingface tokenizer. It tokenizes the full text in order to determine where to split it, then splits down to chunk size. You can safely...

Split by Tokens instead of characters: RecursiveCharacterTextSplitter

Less familiar with tiktoken, but looking at the function def it appears to be doing the right thing (note the `_tiktoken_encoder` function that gets passed into `length_function` for the splitter)....

Complete SQLAlchemy inline pep484 typing

Mypy raises errors about `MutableDict.as_mutable(JSONB)` after upgrading to sqlalchemy 2.x (with postgresql dialect) ``` Argument 1 to "as_mutable" of "Mutable" has incompatible type "Type[JSONB]"; expected "TypeEngine[Any]" [arg-type] ``` Is this...

Create Kubernetes Annotation for Defining Middlewares

I would really like to have something equivalent to `ingress.kubernetes.io/custom-request-headers` back in v2. Want to add a custom header set separately for each IngressRoute, but it does not seem scalable...

Create Kubernetes Annotation for Defining Middlewares

Did a little testing, looks like traefik's memory usage grows ~25MiB per 1000 header middlewares that each add a single header. So that's 25KiB per 50 bytes of actual data...