dumbo icon indicating copy to clipboard operation
dumbo copied to clipboard

hadoopy backend?

Open dgleich opened this issue 14 years ago • 4 comments

Hey, I really love the job management stuff in dumbo. However, it seems like the inner-core of hadoopy is more highly optimized. (I get a factor of 2 better performance in my tests.) So it seems to me like the right way of combining the two is to write a hadoopy backend for dumbo. Would this be something you'd be interested in adding to dumbo? I'm happy to work on it in some capacity if there is interest.

dgleich avatar Feb 08 '11 06:02 dgleich

I would help with this if there is interest. The purpose of Hadoopy isn't to recreate this functionality, it is to create a thin core python interface for streaming. I use whirr and oozie for cluster and job management respectively (Hadoopy is designed to be compatible with these tools). I can see more casual users not wanting to use these more powerful but complex tools, opting for a more integrated approach.

There are a few things we need to take into account.

  1. Practically, I'd need to relicense my code so that it is compatible (David and Andrew are the only other contributors). This shouldn't be a problem and I'd be willing to do that (I'd most likely dual license it).
  2. Should it be part of dumbo, optional, or a separate fork? I think the cleanest solution is that dumbo can optionally use Hadoopy as a backend if it is available.
  3. Backwards compatibility is going to be an important focus. I'd want to find a diverse set of Dumbo users to work with us running legacy code. Unit tests can help here.

bwhite avatar Feb 08 '11 07:02 bwhite

I'd definitely be interested and I'd be happy to review code or help out with figuring out how to hook things up or so. As I'm pretty busy these days I probably won't be able to help with the actual coding though, but it looks like we might already have enough manpower to get something done I guess. So bring on the code -- I look forward to having a look at it and trying it out.. :)

klbostee avatar Feb 08 '11 09:02 klbostee

Okay, this sounds like something worth pursuing. (At least, I would really like it. I had to switch back to dumbo for some last minute tests in a paper recently because I needed some of the libegg/libjar/etc. features.)

One question: Would you need to dual license it if dumbo just used it as a black-box backend? (I am not up to speed on how python's "import" acts with respect to licenses.) I agree that this is the cleanest approach.

dgleich avatar Feb 11 '11 17:02 dgleich

Not sure about the licensing either, but surely we could figure something out...

klbostee avatar Nov 17 '11 22:11 klbostee