summingbird
summingbird copied to clipboard
Apache beam support?
What do you all think about adding support for apache beam? If there is enough interest I could start looking into this.
Thanks Argyris. Sounds exciting. Tbh we don't have any current plans of using beam via Summingbird but this sounds like a great addition to the Summingbird ecosystem.
On Thu, Jun 22, 2017 at 1:13 AM, Argyris Zymnis [email protected] wrote:
What do you all think about adding support for apache beam https://beam.apache.org/? If there is enough interest I could start looking into this.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/twitter/summingbird/issues/736, or mute the thread https://github.com/notifications/unsubscribe-auth/AAojhu6xWlJx-8OIqhc2G6MbfT04FIrgks5sGiIUgaJpZM4OB8xM .
this wouldn't be that hard to do (there are many planners to look at for examples (memory, concurrentmemory, scalding, storm and an old spark one we removed since we never used it).
Ah seems like this is what scio is doing, albeit using a completely different package. At least they are using algebird under the hood for aggregations. Also they explicitly give a shoutout to scalding in the readme (makes sense since spotify has been using scalding)
I was going to ask how it would compare to https://github.com/spotify/scio myself.
How do we want to envision this? Should Summingbird just be a DSL on top of Beam? If so, what would have buy us? Would we completely drop the Scalding and Storm/Heron counterparts? And let Beam take care of the underlying framework (Spark, DataFlow, etc)? FYI the Heron community is discussing support for something like Beam on top of it.
Well, scio
is like scalding. Summingbird is about streaming intrinsically: there is a notion of time and single events. I think summingbird is, and always should be, about streaming map/reduce.
I think if we could get good performance on top of beam, great. I imagine that work won't/can't be funded to make it as performant and correct as the current code for maybe 1-2 years assuming someone even starts (notice like 4 years ago we were saying similar stuff about spark, which we have never found the time to support).