summingbird icon indicating copy to clipboard operation
summingbird copied to clipboard

Apache beam support?

Open azymnis opened this issue 7 years ago • 5 comments

What do you all think about adding support for apache beam? If there is enough interest I could start looking into this.

azymnis avatar Jun 22 '17 08:06 azymnis

Thanks Argyris. Sounds exciting. Tbh we don't have any current plans of using beam via Summingbird but this sounds like a great addition to the Summingbird ecosystem.

On Thu, Jun 22, 2017 at 1:13 AM, Argyris Zymnis [email protected] wrote:

What do you all think about adding support for apache beam https://beam.apache.org/? If there is enough interest I could start looking into this.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/twitter/summingbird/issues/736, or mute the thread https://github.com/notifications/unsubscribe-auth/AAojhu6xWlJx-8OIqhc2G6MbfT04FIrgks5sGiIUgaJpZM4OB8xM .

pankajroark avatar Jun 22 '17 17:06 pankajroark

this wouldn't be that hard to do (there are many planners to look at for examples (memory, concurrentmemory, scalding, storm and an old spark one we removed since we never used it).

johnynek avatar Jun 26 '17 21:06 johnynek

Ah seems like this is what scio is doing, albeit using a completely different package. At least they are using algebird under the hood for aggregations. Also they explicitly give a shoutout to scalding in the readme (makes sense since spotify has been using scalding)

azymnis avatar Jun 26 '17 21:06 azymnis

I was going to ask how it would compare to https://github.com/spotify/scio myself.

How do we want to envision this? Should Summingbird just be a DSL on top of Beam? If so, what would have buy us? Would we completely drop the Scalding and Storm/Heron counterparts? And let Beam take care of the underlying framework (Spark, DataFlow, etc)? FYI the Heron community is discussing support for something like Beam on top of it.

sriramkrishnan avatar Jun 26 '17 21:06 sriramkrishnan

Well, scio is like scalding. Summingbird is about streaming intrinsically: there is a notion of time and single events. I think summingbird is, and always should be, about streaming map/reduce.

I think if we could get good performance on top of beam, great. I imagine that work won't/can't be funded to make it as performant and correct as the current code for maybe 1-2 years assuming someone even starts (notice like 4 years ago we were saying similar stuff about spark, which we have never found the time to support).

johnynek avatar Jun 27 '17 00:06 johnynek