mongoengine
mongoengine copied to clipboard
How do I skip duplicate documents in a bulk insert?
I'm trying to insert documents in bulk, I have created a unique index in my collection and want to skip documents which are duplicate while doing bulk insertion. This can be accomplished with native mongodb function:
db.collection.insert(
<document or array of documents>,
{
ordered: false
}
)
How can I achieve this in mongoengine?
Unfortunately before we can support the ordered kwarg, we'll have to migrate to PyMongo 3.0+'s collection.insert_one and collection.insert_many methods. Right now we're still using the deprecated collection.insert, which doesn't support it.
In the meantime, you can use write_concern={'continue_on_error': True}. Note, however, that this won't be supported in future releases and is a hack around a poor implementation of the write_concern kwarg. You'll also have to wrap your insert in a try-except, catching NotUniqueError:
In [28]: from mongoengine import *
In [29]: connect('testdb')
Out[29]: MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary())
In [30]: class Doc(Document):
...: txt = StringField(unique=True)
...:
In [31]: Doc.drop_collection()
In [32]: Doc.objects.insert([Doc(txt='1'), Doc(txt='2')])
Out[32]: [<Doc: Doc object>, <Doc: Doc object>]
In [33]: try:
...: Doc.objects.insert([Doc(txt='1'), Doc(txt='2'), Doc(txt='3')], write_concern={'continue_on_error': True})
...:
...: except NotUniqueError:
...: pass
...:
...:
In [34]: Doc.objects.count()
Out[34]: 3
Thanks :-)
Is this write_concern={'continue_on_error': True} not supported anymore?
No its not... But I think it makes sense to re-open this so that support for ordered in bulk insert method (which uses insert_many behind the scene) can be added someday
Is there any workaround for this in current mongoengine release?
For now I am using raw pymongo from mongoengine as a workaround for this. So for a mongoengine Document class DocClass you will access the underlying pymongo collection and execute query like below:
from pymongo.errors import BulkWriteError
try:
doc_list = [doc.to_mongo() for doc in me_doc_list] # Convert ME objects to what pymongo can understand
DocClass._get_collection().insert_many(doc_list, ordered=False)
except BulkWriteError as bwe:
print("Batch Inserted with some errors. May be some duplicates were found and are skipped.")
print(f"Count is {DocClass.objects.count()}.")
except Exception as e:
print( { 'error': str(e) })
Anybody is working on this issue, or is it even in backlog?
Does anyone has a way around for this? since continue_on_error is unexpected from write_concern arguments