snoowrap
snoowrap copied to clipboard
Implementation of streams
Given most popular wrappers have streams implemented out of the box, should we also support it?
Right now if someone wants to create a stream they have to do one of three things, wrap an endpoint in a loop and call it every so often, create their own streaming wrapper around snoowrap and use that, or use something like snoostorm which is a streaming wrapper for snoowrap.
Streams are a good tool for bots, and having it implemented out of the box is something I feel can be valuable for bot creators. What's everyone's thoughts?
I had a brief discussion with not-an-aardvark on Gitter:
SpyTec: Have you had thoughts about implementing Submission, or Comment streaming? I've noticed there are a few snoowrap libraries for it
not-an-aardvark: Re. streaming, I think I'd rather leave it out of snoowrap since it's basically just a matter of invoking snoowrap repeatedly at a set interval.
SpyTec: Though with streaming, most wrappers have this implemented in some way, praw, Go’s wrapper (graw I think)
Since a lot of people want to check a stream of content for their bot or whatnot I think it only makes sense to add it, otherwise people have to implement it themselves every time or use another dependency like snoostorm
not-an-aardvark: I guess my take on streaming is that a working implementation is extremely easy outside of snoowrap (just put a setInterval around your code and you're pretty much done), and with all the parameters that would be needed for a full-fledged implementation, it would be better to let people do it themselves. I haven't looked at other wrappers' versions of streaming in detail, but is it basically just doing requests in a loop and storing a list of items that have already been seen?
SpyTec: Mostly. With streaming I've seen people use setInterval together with an EventEmitter, so you can start up the stream and listen for events. PRAW creates a for-loop that yields new results. GRAW does something as PRAW with channels and goroutines I think we should at least open an issue for it saying it's okay to implement.
I don't object to adding this/saying it's okay to implement. There are a few complexities that we might need to deal with or make a decision on:
- What configuration options do we need? I imagine people might want to specify the polling interval. Are there other things that people need to control?
- What happens if reddit returns an error on one of the requests? Aborting the whole stream probably isn't the most stable choice, since sporadic errors happen occasionally. On the other hand, if reddit is down or overloaded, continuing to send requests and repeatedly getting errors doesn't seem ideal, particularly if the requests are very frequent.
- What does the API look like? I think we could either add a method to
Listing, or provide a wrapper around some other function that returns listings (e.g.snoowrap.stream(r.getHot)). Do we need to support the case where a Listing is somewhere in the middle of a response body (e.g. streaming replies to a comment) rather than being the entire response body? - What do we do for APIs that don't directly support sorting by "new"? That could make it difficult to have streaming for those APIs.
The potential complexity here is what leads to my intuition that maybe it would be better to let users handle this themselves, since they know what their requirements are and they don't need to handle the special cases. But I'm fine with adding something if we're sure we know what we're doing.
You’re right in the complexity issue. I think for our purpose it would be best to implement streams for the most common endpoints. Submissions, comments, and anything that sorts by new.
But given at the same time I want the user to be able to apply conditionals to the stream to determine if content is new. I think this might be best done by having an optional parameter that takes in a function to determine this.
Some endpoints probably support limiting the returned submissions so we do not need to fetch 25 items at a time or whatever the default is.
We would then also have to store the items received from the endpoints in an array, preferably one that only keeps the last x items (200 should be enough for most cases)