rmongodb icon indicating copy to clipboard operation
rmongodb copied to clipboard

Use rmongodb datatype to read bson objects from stdin in R

Open kfleischmann opened this issue 9 years ago • 3 comments

hey guys. I just try to connect hadoop streaming with R and i thought about the datatypes from rmongodb may help me out.

So this is the idea

Hadoop Streaming[hadoop-mongo-connector] -> mapper.py -> reducer.R

the mapper is really straight forward using the implementation from pymongo_hadoop see https://github.com/mongodb/mongo-hadoop/tree/master/streaming/language_support/python

i want something like iterating over the stdin.

conn <- file("stdin", open="r")
buf <- mongo.bson.buffer.create()

// R does not allow that bcause conn is not the correct datatype
mongo.bson.buffer.append.raw( conn )

// iterate over buf 

someone out there has a smart idea?

kfleischmann avatar Mar 11 '15 22:03 kfleischmann

Hi. I didn't try hadoop streaming + R + rmongodb, so, unfortunately, probably can't help you. How your stdin input can look like?

P.S. When I need to work with large amount of data (mongodb instance or bson dump), I use

  1. 95% of time - apache spark + scala + mongo-hadoop-connector (which has a lot of bugs from my experience)
  2. 5% of time - R + sparkR + rmongodb.

dselivanov avatar Mar 13 '15 09:03 dselivanov

Hi! Thanks for you reply. The input format is very simple. A array of bson objects. I think the simplest way might be to read each single object and tranform it into a R format. But i don't want to it by myself instead using other libraries. The documentation says that the

mongo.bson objects have "mongo.bson" as their class and contain an externally managed pointer to
the actual document data. 

So i may just point this to stdin or copy it.

btw: very interesting about your usecases. Right now we have a wealth of data. But doing time-series analysis on it is really painfull with different libraries. For each time-series analyis the amount of data fits into memory. The complexity reside in the amount different time-series.

kfleischmann avatar Mar 13 '15 11:03 kfleischmann

A possible solution might be to migrate the scripts to python, because there is still a nice BSON reader/writer and use R within python with rpy2.

But would be glad to avoid that.

kfleischmann avatar Mar 13 '15 15:03 kfleischmann