rmongodb icon indicating copy to clipboard operation
rmongodb copied to clipboard

Batch update & insert: loop in C instead of R

Open RockScience opened this issue 10 years ago • 5 comments

I need to update a lot of documents at the same time. I know mongo doesn't support that. But could we imagine to have in rmongodb a C wrapper so that we do the loop in C instead of R? I would like for instance to pass a big list of BSON and update all of them, using a loop in C. Looping in R is too time consuming....

RockScience avatar Nov 01 '13 07:11 RockScience

can you send me a short example code in R? -> your idea is to move the loop with a mongo.insert command to C, correct?

schmidb avatar Nov 01 '13 15:11 schmidb

Insertion is not an issue because there is a native batch insert that works fine. However updating is more tricky because you have to update the documents one by one. This is not an issue with compiled languages, but in R loops are very slow.

In this example, 'xts' is a time series of class xts. Each timestamp of the time series is a document in mongodb. I have to loop in R over the rows to update the entire time series in the mongodb. (don't look at the detail of the code inside the loop, this is taken out of another project)

so it is going to be something like:

 for (i in 1:nrow(xts)){       

  critbuf <- mongo.bson.buffer.create()
  mongo.bson.buffer.append.timestamp(critbuf, "timestamp", mongo.timestamp.create(strptime(index(xts[i]),"%Y-%m-%d"), increment=1))
  criteria <- mongo.bson.from.buffer(critbuf)

  buf <- mongo.bson.buffer.create()
  mongo.bson.buffer.append.timestamp(buf, "timestamp", mongo.timestamp.create(strptime(index(xts[i]),"%Y-%m-%d"), increment=1))
  mongo.update(mongo, ns, criteria, mongo.bson.from.buffer(buf), mongo.update.upsert)
 }

basically I suggest to extend the fonction mongo.update so that we can pass a vector of mongo objects and a vector of criteria (of same length!) so that the loop is done in C/C++ inside the mongo.update function.

The code will become:

mongo.update(listOfMongoObjects, ns, listOfCriteria, ...)

RockScience avatar Nov 13 '13 04:11 RockScience

idea: make the processing layer configurable mongo.insert( ...., method=c("C", "R"))

schmidb avatar Nov 20 '13 11:11 schmidb

Thanks, indeed it is related.

However, my guess is that the fix for these 2 issues (Issue #14 and Issue #19) are different: (just my guess as I have looked into it but certainly less than you):

Issue #19 is about mongo.insert (here we 'just' need to call the right C function of the API, as there is a native bulk insert in the API since version 2.2 of mongodb: http://docs.mongodb.org/manual/core/bulk-inserts/)

Issue #14 is about mongo.update (which requires extra coding to do the loop in C, as there is no native bulk update)

RockScience avatar Nov 20 '13 11:11 RockScience

This is not entirely the issue of replacing loop into C/C++. The bottleneck in updates is in I/O operations (network and disk). The simplest thing to speed up your updates is to use more connections to mongodb. You can use mclapply or foreach functions to do updates in multiple threads (actually processes). In my experience you can easily use up to 8 threads on machine with 2 cores. In this case you will get 5-7x speed up, which is almost linear. But yes, update function written in C/C++ will be nice to have.

dselivanov avatar Oct 01 '14 07:10 dselivanov