rmongodb
rmongodb copied to clipboard
Performance comparison RMongo and rmongodb
Hi,
I am experimenting with both rmongodb and RMongo. I'm trying to get a single field from a MongoDB database with 31 million records. I'm experimenting with limit on 1 million records, because rmongodb eats my memory when I go higher than that.. If I execute a query with rmongodb where I only extract one field (with an index on it), with data.frame=FALSE, it takes 231 seconds. If I use RMongo, and execute the same query (and get a dataframe for it for free!), it takes 16 seconds. This is an astounding big difference, but it is not only in performance, because the huge list from rmongodb eats up my memory (300 mb)..
What causes this big difference in performance and memory use?
it is quite hard to answer on your question whithout actual data. Can you provide your code and bson dump of your data (or generated example)? what version of rmongodb do you use? There were significant changes in v 1.8.0.
Yes, I am using rmongodb v1.8.0.
You can create a test dataset with a database test with collection zip. Execute the following code:
library(rmongodb)
mongo <- mongo.create()
mongo.is.connected(mongo)
data(zips)
colnames(zips)[5] <- "orig_id"
ziplist <- list()
ziplist <- apply( zips, 1, function(x) c( ziplist, x ) )
res <- lapply( ziplist, function(x) mongo.bson.from.list(x) )
if(mongo.is.connected(mongo) == TRUE){
for(i in 1:100){
mongo.insert.batch(mongo, "test.zip", res )
}
}
This gives 2947000 records.
Load RMongo:
library(RMongo)
rmong <- mongoDbConnect("test")
system.time(testDF <- dbGetQueryForKeys(rmong, 'zip','{}','{"_id":0,"pop":1}',0,1000000))
Gives the result for RMongo (on my system it is 11.79 seconds). In a nice dataframe which is a small size.
For rmongodb:
rmong2 <- mongo.create()
system.time(test2 <- mongo.find.all(rmong2, "test.zip",query= "{}", fields = '{"_id":0,"pop":1}',limit=1000000, data.frame=FALSE))
With this my memory slowly fills, and it finally returns after 350 seconds, and with a list size of 297.5 mb). With data.frame=TRUE it takes even longer. It would be nice if the data.frame would use less memory than the one which returns lists.
The main reason of so big diffrence in memory usage is that mongo.find.all returns list
instead of data.frame
(list of vectors). Try to unlist
result.
And in general I suppose this should be correct approach, because of "unstructured" nature of bson documents (nested subdocuments, arrays of diffrent lengths, etc.)
But dramatic difference in running time is definetely worth to be fixed. I suppose we need to rewrite mongo.cursor.to.list
in pure C/C++.
Of course I can unlist the result, but then I have to have a result first :) If the result doesn't find in memory, I can't unlist it afterwards...
I mean try to unlist to compare memory usage:-) most of the ram was spent on list metadata.
Best regards,
Dmitriy Selivanov 18.03.2015 15:16 пользователь "geertendoornenbal" [email protected] написал:
Of course I can unlist the result, but then I have to have a result first :) If the result doesn't find in memory, I can't unlist it afterwards...
— Reply to this email directly or view it on GitHub https://github.com/mongosoup/rmongodb/issues/78#issuecomment-82944226.
I understand that, and it probably is something similar, but if a intermediate step uses more memory than I have, than I still can't unlist the result.. I can retrieve more records with RMongo than with rmongodb, because of the intermediate memory use.
I understand your point and agree that we need better way to construct data frames.
Best regards,
Dmitriy Selivanov 18.03.2015 15:40 пользователь "geertendoornenbal" [email protected] написал:
I understand that, and it probably is something similar, but if a intermediate step uses more memory than I have, than I still can't unlist the result.. I can retrieve more records with RMongo than with rmongodb, because of the intermediate memory use.
— Reply to this email directly or view it on GitHub https://github.com/mongosoup/rmongodb/issues/78#issuecomment-82952029.