mongoengine icon indicating copy to clipboard operation
mongoengine copied to clipboard

How do I skip duplicate documents in a bulk insert?

Open rohitkhatri opened this issue 8 years ago • 8 comments

I'm trying to insert documents in bulk, I have created a unique index in my collection and want to skip documents which are duplicate while doing bulk insertion. This can be accomplished with native mongodb function:

db.collection.insert(
	<document or array of documents>,
	{
		ordered: false
	}
)

How can I achieve this in mongoengine?

rohitkhatri avatar Jan 09 '17 06:01 rohitkhatri

Unfortunately before we can support the ordered kwarg, we'll have to migrate to PyMongo 3.0+'s collection.insert_one and collection.insert_many methods. Right now we're still using the deprecated collection.insert, which doesn't support it.

In the meantime, you can use write_concern={'continue_on_error': True}. Note, however, that this won't be supported in future releases and is a hack around a poor implementation of the write_concern kwarg. You'll also have to wrap your insert in a try-except, catching NotUniqueError:

In [28]: from mongoengine import *

In [29]: connect('testdb')
Out[29]: MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary())

In [30]: class Doc(Document):
    ...:     txt = StringField(unique=True)
    ...:

In [31]: Doc.drop_collection()

In [32]: Doc.objects.insert([Doc(txt='1'), Doc(txt='2')])
Out[32]: [<Doc: Doc object>, <Doc: Doc object>]

In [33]: try:
    ...:     Doc.objects.insert([Doc(txt='1'), Doc(txt='2'), Doc(txt='3')], write_concern={'continue_on_error': True})
    ...:
    ...: except NotUniqueError:
    ...:     pass
    ...:
    ...:

In [34]: Doc.objects.count()
Out[34]: 3

wojcikstefan avatar Jan 15 '17 04:01 wojcikstefan

Thanks :-)

rohitkhatri avatar Jan 15 '17 06:01 rohitkhatri

Is this write_concern={'continue_on_error': True} not supported anymore?

doaa-altarawy avatar Nov 01 '18 19:11 doaa-altarawy

No its not... But I think it makes sense to re-open this so that support for ordered in bulk insert method (which uses insert_many behind the scene) can be added someday

bagerard avatar Nov 01 '18 20:11 bagerard

Is there any workaround for this in current mongoengine release?

sohaibfarooqi avatar Nov 23 '18 02:11 sohaibfarooqi

For now I am using raw pymongo from mongoengine as a workaround for this. So for a mongoengine Document class DocClass you will access the underlying pymongo collection and execute query like below:

from pymongo.errors import BulkWriteError


try:
    doc_list = [doc.to_mongo() for doc in me_doc_list] # Convert ME objects to what pymongo can understand
    DocClass._get_collection().insert_many(doc_list, ordered=False)

except BulkWriteError as bwe:
    print("Batch Inserted with some errors. May be some duplicates were found and are skipped.")
    print(f"Count is {DocClass.objects.count()}.")

except Exception as e:
    print( { 'error': str(e) })

SiddharthPant avatar Dec 08 '18 08:12 SiddharthPant

Anybody is working on this issue, or is it even in backlog?

Prophetofcthulhu avatar Sep 20 '19 09:09 Prophetofcthulhu

Does anyone has a way around for this? since continue_on_error is unexpected from write_concern arguments

fauzieuy avatar Oct 13 '22 03:10 fauzieuy