rmongodb Performance comparison RMongo and rmongodb

Hi,

I am experimenting with both rmongodb and RMongo. I'm trying to get a single field from a MongoDB database with 31 million records. I'm experimenting with limit on 1 million records, because rmongodb eats my memory when I go higher than that.. If I execute a query with rmongodb where I only extract one field (with an index on it), with data.frame=FALSE, it takes 231 seconds. If I use RMongo, and execute the same query (and get a dataframe for it for free!), it takes 16 seconds. This is an astounding big difference, but it is not only in performance, because the huge list from rmongodb eats up my memory (300 mb)..

What causes this big difference in performance and memory use?

Mar 17 '15 13:03 geertendoornenbal

it is quite hard to answer on your question whithout actual data. Can you provide your code and bson dump of your data (or generated example)? what version of rmongodb do you use? There were significant changes in v 1.8.0.

Mar 17 '15 13:03 dselivanov

Yes, I am using rmongodb v1.8.0.

You can create a test dataset with a database test with collection zip. Execute the following code:

library(rmongodb)
mongo <- mongo.create()
mongo.is.connected(mongo)
data(zips)
colnames(zips)[5] <- "orig_id"
ziplist <- list()
ziplist <- apply( zips, 1, function(x) c( ziplist, x ) )
res <- lapply( ziplist, function(x) mongo.bson.from.list(x) )
if(mongo.is.connected(mongo) == TRUE){
  for(i in 1:100){
    mongo.insert.batch(mongo, "test.zip", res )
  }
}

This gives 2947000 records.

Load RMongo:

library(RMongo)
rmong <- mongoDbConnect("test")
system.time(testDF <- dbGetQueryForKeys(rmong, 'zip','{}','{"_id":0,"pop":1}',0,1000000))

Gives the result for RMongo (on my system it is 11.79 seconds). In a nice dataframe which is a small size.

For rmongodb:

rmong2 <- mongo.create()
system.time(test2 <- mongo.find.all(rmong2, "test.zip",query= "{}", fields = '{"_id":0,"pop":1}',limit=1000000, data.frame=FALSE))

With this my memory slowly fills, and it finally returns after 350 seconds, and with a list size of 297.5 mb). With data.frame=TRUE it takes even longer. It would be nice if the data.frame would use less memory than the one which returns lists.

Mar 17 '15 15:03 geertendoornenbal

The main reason of so big diffrence in memory usage is that mongo.find.all returns list instead of data.frame (list of vectors). Try to unlist result. And in general I suppose this should be correct approach, because of "unstructured" nature of bson documents (nested subdocuments, arrays of diffrent lengths, etc.) But dramatic difference in running time is definetely worth to be fixed. I suppose we need to rewrite mongo.cursor.to.list in pure C/C++.

Mar 18 '15 09:03 dselivanov

Of course I can unlist the result, but then I have to have a result first :) If the result doesn't find in memory, I can't unlist it afterwards...

Mar 18 '15 12:03 geertendoornenbal

I mean try to unlist to compare memory usage:-) most of the ram was spent on list metadata.

Best regards,

Dmitriy Selivanov 18.03.2015 15:16 пользователь "geertendoornenbal" [email protected] написал:

Of course I can unlist the result, but then I have to have a result first :) If the result doesn't find in memory, I can't unlist it afterwards...

— Reply to this email directly or view it on GitHub https://github.com/mongosoup/rmongodb/issues/78#issuecomment-82944226.

Mar 18 '15 12:03 dselivanov

I understand that, and it probably is something similar, but if a intermediate step uses more memory than I have, than I still can't unlist the result.. I can retrieve more records with RMongo than with rmongodb, because of the intermediate memory use.

Mar 18 '15 12:03 geertendoornenbal

I understand your point and agree that we need better way to construct data frames.

Best regards,

Dmitriy Selivanov 18.03.2015 15:40 пользователь "geertendoornenbal" [email protected] написал:

I understand that, and it probably is something similar, but if a intermediate step uses more memory than I have, than I still can't unlist the result.. I can retrieve more records with RMongo than with rmongodb, because of the intermediate memory use.

— Reply to this email directly or view it on GitHub https://github.com/mongosoup/rmongodb/issues/78#issuecomment-82952029.

Mar 18 '15 12:03 dselivanov

rmongodb rmongodb copied to clipboard

Performance comparison RMongo and rmongodb

rmongodb
rmongodb copied to clipboard