semchunk icon indicating copy to clipboard operation
semchunk copied to clipboard

Request: offer and and generators and async generators

Open Goldziher opened this issue 1 year ago • 5 comments

Hi there!

Thanks for this neat library. I'm giving it a go.

It would be great to have two variants of the chunkerify function that return a generator and async generator, and a version that is async.

Use cases:

  • async evaluation is good for non-blocking situations, for example - chunking dynamically inside a web request, which in a blocking (sync scenario) will impact the backend service as a whole in some cases. Furthermore, it could allow for creating a concurrent (not parallel) version of chunking perhaps.
  • returning a generator allows evaluating in intervals and executing code in between, for example - a for loop.
  • returning an async generator offers the same, within an async context.

The simplest option (but non performant) version for implementing async logic, is simply to execute the sync version using something like anyio.to_thread.run_sync: https://anyio.readthedocs.io/en/stable/threads.html.

Goldziher avatar Jun 19 '24 17:06 Goldziher

Offering a generator chunker and perhaps even support for lazy chunking is something I’m open to. I’ll start work on that shortly.

With regard to offering an asynchronous generator, I’m not too sure what value there would be in that when there isn’t anything I’m aware of in my chunker that is IO-bound. And seeing as synchronous functions and generators are already callable within asynchronous environments, making chunkers asynchronous would only seem to add more overhead. If there’s something I’m missing here, however, please let me know.

umarbutler avatar Jun 20 '24 02:06 umarbutler

Offering a generator chunker and perhaps even support for lazy chunking is something I’m open to. I’ll start work on that shortly.

With regard to offering an asynchronous generator, I’m not too sure what value there would be in that when there isn’t anything I’m aware of in my chunker that is IO-bound. And seeing as synchronous functions and generators are already callable within asynchronous environments, making chunkers asynchronous would only seem to add more overhead. If there’s something I’m missing here, however, please let me know.

using an asnyc iterator / generator allows for streaming the source rather than loading it all into memory.

Goldziher avatar Jun 20 '24 13:06 Goldziher

So you imagine it being used to handle inputs that are async iterators, is that right? For example:

chunker = chunkerify(...)
texts = my_async_text_generator()

# Normally you'd do this:
chunks = [chunker(text) async for text in texts]

# But you'd like to be able to do this(?)
chunks = await chunker(texts)

umarbutler avatar Jun 20 '24 13:06 umarbutler

So you imagine it being used to handle inputs that are async iterators, is that right? For example:

chunker = chunkerify(...)
texts = my_async_text_generator()

# Normally you'd do this:
chunks = [chunker(text) async for text in texts]

# But you'd like to be able to do this(?)
chunks = await chunker(texts)

For a stream I would use an async iterator (e.g. async generator)

But using async for chunking is purely for IO bound situations, like using chunking in an API. The advantage of

chunks = await chunker(texts)

Is that this will be ran in an async worker thread rather than the main thread, and thus not block the execution of other async threads.

I can fake it by doing something like

await anyio.to_thread.run_sync(chunker, texts)

But this is pretty suboptimal since it slows execution quite a bit.

Goldziher avatar Jun 20 '24 14:06 Goldziher

@Goldziher sorry for the delay, I hadn't been focused on semchunk for the past couple months, but I returned recently to add some new features. I'm taking another look at this.

I understand your use case now, I'm also working with an async web server where I'm going to deploy semchunk and I can anticipate needing to run it in its own thread. I'm wondering though, how would I go about making semchunk run it its own async thread. Do you have an example I can work off?

umarbutler avatar Dec 31 '24 08:12 umarbutler