Hive-mongo icon indicating copy to clipboard operation
Hive-mongo copied to clipboard

Feature Request: read bson data directly from dbpath (without mongod running)

Open MadDataScience opened this issue 12 years ago • 4 comments

It would be really cool of Hive-mongo could read directly from MongoDB files rather than having to go through a mongod process (this way I could run it directly against backups without having to start mongod on them). If this is too difficult/impossible, the next best thing would be to be able to run it against the bson files produced by mongodump (though at that point, I'm already halfway to exporting the data to another format anyway).

MadDataScience avatar May 07 '12 23:05 MadDataScience

Great idea! Currently it's not supported since we have different use cases: we use mongodb to store some meta/user profile data, and we need to both query and update to it.

The mongo dump file seems just a collection of BSON objects, so if there have a delimiter for each row/bson object, which needed is just a bson SerDe. (and a custom split implementation might also needed to enable parallel processing). Not sure how difficult to implement this base on the java driver's bson code, still need further investigation.

I think you could dump as CSV file using mongoexport as a workaround. If the CSV is huge, compression(snappy, lzo,bz2,gzip) might helps.

On Tue, May 8, 2012 at 7:52 AM, Alessandro D. Gagliardi < [email protected]

wrote:

It would be really cool of Hive-mongo could read directly from MongoDB files rather than having to go through a mongod process (this way I could run it directly against backups without having to start mongod on them). If this is too difficult/impossible, the next best thing would be to be able to run it against the bson files produced by mongodump (though at that point, I'm already halfway to exporting the data to another format anyway).


Reply to this email directly or view it on GitHub: https://github.com/yc-huang/Hive-mongo/issues/4

yc-huang avatar May 08 '12 02:05 yc-huang

CSV is no good as we have shifting schemata and nested documents and all kinds of other madness that make CSV a mess. I imagine you're already aware of https://github.com/mongodb/mongo-hadoop but I thought I'd mention it just in case.

MadDataScience avatar May 08 '12 16:05 MadDataScience

yeah, they have a wonderful shard-aware input split implementation and we'd like to migrate Hive-mongo to use that...

On Wednesday, May 9, 2012, Alessandro D. Gagliardi wrote:

CSV is no good as we have shifting schemata and nested documents and all kinds of other madness that make CSV a mess. I imagine you're already aware of https://github.com/mongodb/mongo-hadoop but I thought I'd mention it just in case.


Reply to this email directly or view it on GitHub: https://github.com/yc-huang/Hive-mongo/issues/4#issuecomment-5579731

yc-huang avatar May 10 '12 01:05 yc-huang

Just got message from 10gen engineer that they have a hive connector which currently support static bson file: https://github.com/mongodb/mongo-hadoop/tree/master/hive

yc-huang avatar Jun 25 '12 02:06 yc-huang