Hive-mongo
Hive-mongo copied to clipboard
Feature Request: read bson data directly from dbpath (without mongod running)
It would be really cool of Hive-mongo could read directly from MongoDB files rather than having to go through a mongod process (this way I could run it directly against backups without having to start mongod on them). If this is too difficult/impossible, the next best thing would be to be able to run it against the bson files produced by mongodump (though at that point, I'm already halfway to exporting the data to another format anyway).
Great idea! Currently it's not supported since we have different use cases: we use mongodb to store some meta/user profile data, and we need to both query and update to it.
The mongo dump file seems just a collection of BSON objects, so if there have a delimiter for each row/bson object, which needed is just a bson SerDe. (and a custom split implementation might also needed to enable parallel processing). Not sure how difficult to implement this base on the java driver's bson code, still need further investigation.
I think you could dump as CSV file using mongoexport as a workaround. If the CSV is huge, compression(snappy, lzo,bz2,gzip) might helps.
On Tue, May 8, 2012 at 7:52 AM, Alessandro D. Gagliardi < [email protected]
wrote:
It would be really cool of Hive-mongo could read directly from MongoDB files rather than having to go through a mongod process (this way I could run it directly against backups without having to start mongod on them). If this is too difficult/impossible, the next best thing would be to be able to run it against the bson files produced by mongodump (though at that point, I'm already halfway to exporting the data to another format anyway).
Reply to this email directly or view it on GitHub: https://github.com/yc-huang/Hive-mongo/issues/4
CSV is no good as we have shifting schemata and nested documents and all kinds of other madness that make CSV a mess. I imagine you're already aware of https://github.com/mongodb/mongo-hadoop but I thought I'd mention it just in case.
yeah, they have a wonderful shard-aware input split implementation and we'd like to migrate Hive-mongo to use that...
On Wednesday, May 9, 2012, Alessandro D. Gagliardi wrote:
CSV is no good as we have shifting schemata and nested documents and all kinds of other madness that make CSV a mess. I imagine you're already aware of https://github.com/mongodb/mongo-hadoop but I thought I'd mention it just in case.
Reply to this email directly or view it on GitHub: https://github.com/yc-huang/Hive-mongo/issues/4#issuecomment-5579731
Just got message from 10gen engineer that they have a hive connector which currently support static bson file: https://github.com/mongodb/mongo-hadoop/tree/master/hive