rmongodb
rmongodb copied to clipboard
Use rmongodb datatype to read bson objects from stdin in R
hey guys. I just try to connect hadoop streaming with R and i thought about the datatypes from rmongodb may help me out.
So this is the idea
Hadoop Streaming[hadoop-mongo-connector] -> mapper.py -> reducer.R
the mapper is really straight forward using the implementation from pymongo_hadoop see https://github.com/mongodb/mongo-hadoop/tree/master/streaming/language_support/python
i want something like iterating over the stdin.
conn <- file("stdin", open="r")
buf <- mongo.bson.buffer.create()
// R does not allow that bcause conn is not the correct datatype
mongo.bson.buffer.append.raw( conn )
// iterate over buf
someone out there has a smart idea?
Hi. I didn't try hadoop streaming + R + rmongodb, so, unfortunately, probably can't help you. How your stdin input can look like?
P.S. When I need to work with large amount of data (mongodb instance or bson dump), I use
- 95% of time - apache spark + scala + mongo-hadoop-connector (which has a lot of bugs from my experience)
- 5% of time - R + sparkR + rmongodb.
Hi! Thanks for you reply. The input format is very simple. A array of bson objects. I think the simplest way might be to read each single object and tranform it into a R format. But i don't want to it by myself instead using other libraries. The documentation says that the
mongo.bson objects have "mongo.bson" as their class and contain an externally managed pointer to
the actual document data.
So i may just point this to stdin or copy it.
btw: very interesting about your usecases. Right now we have a wealth of data. But doing time-series analysis on it is really painfull with different libraries. For each time-series analyis the amount of data fits into memory. The complexity reside in the amount different time-series.
A possible solution might be to migrate the scripts to python, because there is still a nice BSON reader/writer and use R within python with rpy2.
But would be glad to avoid that.