mongoengine
mongoengine copied to clipboard
Mongoengine is very slow on large documents compared to native pymongo usage
(See also this StackOverflow question)
I have the following mongoengine model:
class MyModel(Document):
date = DateTimeField(required = True)
data_dict_1 = DictField(required = False)
data_dict_2 = DictField(required = True)
In some cases the document in the DB can be very large (around 5-10MB), and the data_dict fields contain complex nested documents (dict of lists of dicts, etc...).
I have encountered two (possibly related) issues:
- When I run native
pymongo find_one()
query, it returns within a second. When I runMyModel.objects.first()
it takes 5-10 seconds. - When I query a single large document from the DB, and then access its field, it takes 10-20 seconds just to do the following:
m = MyModel.objects.first()
val = m.data_dict_1.get(some_key)
The data in the object does not contain any references to any other objects, so it is not an issue of objects dereferencing. I suspect it is related to some inefficiency of the internal data representation of mongoengine, which affects the document object construction as well as fields access.
Hi,
Have you profiled the execution with Profile/cProfile
? A graph of it with objgraph
should give us a better view of where the trouble is.
Hi,
Please see the following chart: http://i.stack.imgur.com/qAb0t.png (attached in the answer to my StackOverflow question) - it shows that the bottleneck is in "DictField.to_python" (being called 600,000 times).
Have you tried using MongoEngine with PyPy? Most of the overhead is gone because of PyPy's JIT. But you're right. We need a better benchmarks suite.
The entire approach of eager conversion is potentially fundamentally flawed. Lazy conversion on first access would defer all conversion overhead to the point where the structure is actually accessed, completely eliminating it in the situation where access is not made. (Beyond such a situation indicating proper use of .only()
and related helpers is warranted.)
Duplicate of #1137 ?
@amcgregor, your help is valuable.
Would you like to join @MongoEngine/mongoengine team and/or get write access to the repo?
Even if you don't intend to contribute code right now, at least you'd be able to act on the bugtracker, close/tag issues.
I can't speak on behalf of the team, but I think a new recruit would be appreciated.
@lafrech Unfortunately, I'm going to have to decline the offer. Ref: marrow/contentment#12… though I didn't expect Github to ping each referenced ticket like that. Apologies! O_O
@amcgregor No pb. Thanks for your inputs, anyway.
I've played a bit with a snippet and with some modifications( https://gist.github.com/apolkosnik/e94327e92dd2e0642e2b263efd87d1b1 ), then I ran it against Mongoengine 0.8.8 and 0.11:
Please see the pictures...
On 0.8.8:
mongoengine with dict took 16.95s
On 0.11:
mongoengine with dict took 32.74s
It looks like some change from 0.8.8 to 0.9+ caused the get() in ComplexBaseField class to go on a dereference spree for dicts().
@wojcikstefan first of all, thank you for contributions to Mongoengine.
We are using Mongoengine heavily in production and running into this issue. Is this something you are actively looking into?
@sauravshah I started investigating this issue and planned to release a fix for this
If you cannot wait, the trouble is in the Document initialization where a class is created then instanciated.
By replacing this self._data = SemiStrictDict.create(allowed_keys=self._fields_ordered)()
by a simple self._data = {}
I get a 30% boost on my entire application (not a microbenchmark)
It is the same thing for StrictDict
, but it is not as simple to fix (the StrictDict
subclass should be generated in the metaclass defining the Document
). However I didn't see where it is really used.
There is 2 other troubles related to performances that could hit you:
#1446 (pymongo3.4 don't have ensure_index
so a create_index
request is actually send to the mongodb server before every save
done by mongoengine). Solution is to handle index creation manually and disable it in mongoengine with meta = {...,
auto_create_index: False}
#298 Accessing to reference field cause the fetching from the database of the document. This is really an issue if you only wanted to access it _id
field which was already known in the first place... I'm working on a fix for this but it is really complicated issue given dereferencing early is at the core of mongoengine :(
Thanks, @touilleMan - that helps a bit. I looked at closeio's fork of the project and they seem to have some good ideas.
Thank you for pointing to the other issues, they have already started hitting us in production :). I'm excited to know that you are on it, really looking forward to a performant mongoengine.
Please let us know if you figure out any more quick wins in the meantime! Also, let me know if I can help in any way.
@touilleMan were you able to fix these?
@sauravshah sorry I had a branch ready but forget to issue a PR, here it is: #1630, can you have a look on it ?
Considering #298, I've tried numerous methods to create lazy referencing, but it involve too much magic (typically when having inheritance within the class referenced, you can't know before dereferencing what will be the type of the instance to return)
So in the end, I will try to provide a new type of field LazyReferenceField
which would return a Reference
class instance, allowing to access pk
or to call fetch()
to get back the actual document. But this mean one should rework it code to make use of this feature :-(
@touilleMan #1630 looks good to me.
Reg. #298, is it possible to take the class to be referenced as a kwarg on ReferenceField
and solve this issue? Calling .fetch
would be too much rework in most cases (including ours). Also, how would you solve the referenced class issue in .fetch
?
is it possible to take the class to be referenced as a kwarg on ReferenceField and solve this issue?
Not sure what you mean...
We could add __getattr__
/__setattr__
methods to the LazyReference
which would dereference the document when accessed and modify it.
This way you wouldn't have to change you code, except if you use isinstance
, but this should lower a lot the amount of things that need to be reworked ;-)
Why can't be we follow this approach?
class A(Document):
b = ReferenceField(B)
When A is loaded, we already have B's id
, so an instance of class B can be created with just id
(with a flag on the class to denote its not been loaded yet). isintance
would work correctly in this case.
Once __getattr__/__setattr__
is called a query to the DB could load the actual mongo document.
The trouble B
maybe have children classes,:
class B(Document):
meta= {'allow_inheritance': True}
class Bchild(B):
pass
class A(Document):
b = ReferenceField(B)
b_child = Bchild().save()
a = A(b=b_child).save()
In this example you cannot know a.b
is a Bchild
instance before dereferencing it.
Ah ok, I understand the problem now. This is not a big deal for us (and most projects I would assume).
For backward compatibility, is it possible to add a kwarg to LazyReference
(maybe ignore_inheritance
) and make isintance
work when that kwarg is present?
isinstance
is being used all over the place in django-mongoengine
, so would be great to not dereference on it.
As an interesting idea, it can know what it references prior to dereferencing if the _cls
reference is stored in the DBRef
(concrete; technically allowed via **kwargs
inclusion in the resulting SON) or if stored similarly to a CahcedReferenceField
that incorporates that value.
Does anyone know if there is a patch in the works for this issue?
@benjhastings #1690 is the solution, but it requires some change on your code (switching from ReferenceField
to LazyReferenceField
)
How does that work if a DictField is used though as per https://stackoverflow.com/questions/35257305/mongoengine-is-very-slow-on-large-documents-compared-to-native-pymongo-usage/35274930#35274930 ?
@benjhastings If you perf trouble comes from a too big document... well there is nothing that can save you right now :-( I guess the DictField could be improved (or a RawDictField could be created) to do no deserialization at all on the data.
I have written an alternative Document container type which preserves, internally, the MongoDB native value types rather than Python typecast values, casts only on actual attribute access (get/set) to the desired field, and is directly usable with PyMongo base APIs as if it were a dictionary; no conversion on use. No eager bulk conversion of records' values as they stream in, which is a large part of the overhead, and similarly, no eager dereferencing (additional independent read queries for each record loaded) going with a hardline interpretation of "explicit is better than implicit". Relevant links:
- Document (MutableMapping proxy to an ordered dictionary)
- Container (underlying declarative base class)
- Reference (Field subclass)
Use of a Reference(Foo, concrete=True, cache=['_cls'])
would store an import path reference (e.g. "myapp.model:Foo") within the DBRef. (If Foo
is a model that allows subclassing; typically by inheriting the Derived
trait which defines and automates the calculation of a _cls
field import reference.)
...well...i just got annoyed enough with mongoengine enough to google what's what to find this...great.
should be on current versions of pymongo and mongoengine per pip install -U.
here's my output a la @apolkosnik:
dict:
embed:
console: pymongo with dict took 0.06s pymongo with embed took 0.06s mongoengine with dict took 16.72s mongoengine with embed took 0.74s mongoengine with dict as_pymongo() took 0.06s mongoengine with embed as_pymongo() took 0.06s mongoengine aggregation with dict took 0.11s mongoengine aggregation with embed took 0.11s
if DictField is the issue, then please for the love of all that is holy, let us know what to change it to or fix it. watching mongo and pymongo respond almost immediately and then waiting close to a minute for mongoengine to...do whatever it's doing...kind of a massive bottleneck. dig the rest of the package, but if this can't be resolved on the package side...
cricket...cricket...
oh look at that. pymodm. and to porting we go.
Just hit this bottleneck inserting a couple ~5MB documents. Pretty much a deal breaker, having an insert that takes less than a second with pymongo take over a minute with MongoEngine.
@nickfaughey I switched to pymodm. It took very little if any modification with my existing code and is lightning fast. And by MongoDB, so ongoing development.