Umar Butler
Umar Butler
Hi @vrdn-23, Apologies for not getting to this issue earlier, I had not seen it until now. I'll have a thinking about how I might implement this. Would you mind...
@jcobol Your solution seems to cause chunks to exceed their original chunk size (which was 2). But I imagine that those wanting overlap also want to impose a fixed limit...
> Yes, I meant for the small chunk size already represent the reduced chunk size to account for overlaps. Ideally, this should be done internally to the library, as you...
@jcobol Would the below implementation work for you? ```python import nltk import semchunk nltk.download('gutenberg') gutenberg = nltk.corpus.gutenberg def overlap(chunks: list[str]) -> list[str]: n_chunks = len(chunks) match n_chunks: # If there...
@jcobol @vrdn-23 @kushal-agrawal-relativity @benbrandt sorry it's taken me a while but overlapping is now possible with version 3.0.0! I've also added the ability to return offsets as well.
+1
Offering a generator chunker and perhaps even support for lazy chunking is something I’m open to. I’ll start work on that shortly. With regard to offering an asynchronous generator, I’m...
So you imagine it being used to handle inputs that are async iterators, is that right? For example: ```python chunker = chunkerify(...) texts = my_async_text_generator() # Normally you'd do this:...
@Goldziher sorry for the delay, I hadn't been focused on `semchunk` for the past couple months, but I returned recently to add some new features. I'm taking another look at...
Hey @do-me, Thanks for creating this issue. I certainly haven't seen anything like this before! I can confirm that it also takes an awfully long time on my PC. It...