mongoengine Mongoengine is very slow on large documents compared to native pymongo usage

(See also this StackOverflow question)

I have the following mongoengine model:

class MyModel(Document):
    date = DateTimeField(required = True)
    data_dict_1 = DictField(required = False)
    data_dict_2 = DictField(required = True)

In some cases the document in the DB can be very large (around 5-10MB), and the data_dict fields contain complex nested documents (dict of lists of dicts, etc...).

I have encountered two (possibly related) issues:

When I run native pymongo find_one() query, it returns within a second. When I run MyModel.objects.first() it takes 5-10 seconds.
When I query a single large document from the DB, and then access its field, it takes 10-20 seconds just to do the following: m = MyModel.objects.first() val = m.data_dict_1.get(some_key)

The data in the object does not contain any references to any other objects, so it is not an issue of objects dereferencing. I suspect it is related to some inefficiency of the internal data representation of mongoengine, which affects the document object construction as well as fields access.

Feb 07 '16 23:02 baruchoxman

Hi,

Have you profiled the execution with Profile/cProfile ? A graph of it with objgraph should give us a better view of where the trouble is.

Feb 08 '16 07:02 touilleMan

Hi,

Please see the following chart: http://i.stack.imgur.com/qAb0t.png (attached in the answer to my StackOverflow question) - it shows that the bottleneck is in "DictField.to_python" (being called 600,000 times).

Feb 08 '16 19:02 baruchoxman

Have you tried using MongoEngine with PyPy? Most of the overhead is gone because of PyPy's JIT. But you're right. We need a better benchmarks suite.

Mar 10 '16 09:03 thedrow

The entire approach of eager conversion is potentially fundamentally flawed. Lazy conversion on first access would defer all conversion overhead to the point where the structure is actually accessed, completely eliminating it in the situation where access is not made. (Beyond such a situation indicating proper use of .only() and related helpers is warranted.)

Jun 23 '16 15:06 amcgregor

Duplicate of #1137 ?

Jun 23 '16 15:06 amcgregor

@amcgregor, your help is valuable.

Would you like to join @MongoEngine/mongoengine team and/or get write access to the repo?

Even if you don't intend to contribute code right now, at least you'd be able to act on the bugtracker, close/tag issues.

I can't speak on behalf of the team, but I think a new recruit would be appreciated.

Jun 23 '16 15:06 lafrech

@lafrech Unfortunately, I'm going to have to decline the offer. Ref: marrow/contentment#12… though I didn't expect Github to ping each referenced ticket like that. Apologies! O_O

Jun 23 '16 15:06 amcgregor

@amcgregor No pb. Thanks for your inputs, anyway.

Jun 23 '16 16:06 lafrech

I've played a bit with a snippet and with some modifications( https://gist.github.com/apolkosnik/e94327e92dd2e0642e2b263efd87d1b1 ), then I ran it against Mongoengine 0.8.8 and 0.11:

Please see the pictures... On 0.8.8: mongoengine with dict took 16.95s viz_dict_me0_8_8

On 0.11: mongoengine with dict took 32.74s viz_dict_me0_11_0

Mar 08 '17 15:03 apolkosnik

It looks like some change from 0.8.8 to 0.9+ caused the get() in ComplexBaseField class to go on a dereference spree for dicts().

Mar 16 '17 19:03 apolkosnik

@wojcikstefan first of all, thank you for contributions to Mongoengine.

We are using Mongoengine heavily in production and running into this issue. Is this something you are actively looking into?

Jul 19 '17 01:07 sauravshah

@sauravshah I started investigating this issue and planned to release a fix for this

If you cannot wait, the trouble is in the Document initialization where a class is created then instanciated. By replacing this self._data = SemiStrictDict.create(allowed_keys=self._fields_ordered)() by a simple self._data = {} I get a 30% boost on my entire application (not a microbenchmark)

It is the same thing for StrictDict, but it is not as simple to fix (the StrictDict subclass should be generated in the metaclass defining the Document). However I didn't see where it is really used.

There is 2 other troubles related to performances that could hit you:

#1446 (pymongo3.4 don't have ensure_index so a create_index request is actually send to the mongodb server before every save done by mongoengine). Solution is to handle index creation manually and disable it in mongoengine with meta = {..., auto_create_index: False}

#298 Accessing to reference field cause the fetching from the database of the document. This is really an issue if you only wanted to access it _id field which was already known in the first place... I'm working on a fix for this but it is really complicated issue given dereferencing early is at the core of mongoengine :(

Jul 19 '17 08:07 touilleMan

Thanks, @touilleMan - that helps a bit. I looked at closeio's fork of the project and they seem to have some good ideas.

Thank you for pointing to the other issues, they have already started hitting us in production :). I'm excited to know that you are on it, really looking forward to a performant mongoengine.

Please let us know if you figure out any more quick wins in the meantime! Also, let me know if I can help in any way.

Jul 19 '17 08:07 sauravshah

@touilleMan were you able to fix these?

Aug 24 '17 14:08 sauravshah

@sauravshah sorry I had a branch ready but forget to issue a PR, here it is: #1630, can you have a look on it ?

Considering #298, I've tried numerous methods to create lazy referencing, but it involve too much magic (typically when having inheritance within the class referenced, you can't know before dereferencing what will be the type of the instance to return) So in the end, I will try to provide a new type of field LazyReferenceField which would return a Reference class instance, allowing to access pk or to call fetch() to get back the actual document. But this mean one should rework it code to make use of this feature :-(

Aug 24 '17 15:08 touilleMan

@touilleMan #1630 looks good to me.

Reg. #298, is it possible to take the class to be referenced as a kwarg on ReferenceField and solve this issue? Calling .fetch would be too much rework in most cases (including ours). Also, how would you solve the referenced class issue in .fetch ?

Aug 28 '17 09:08 sauravshah

is it possible to take the class to be referenced as a kwarg on ReferenceField and solve this issue?

Not sure what you mean... We could add __getattr__/__setattr__ methods to the LazyReference which would dereference the document when accessed and modify it. This way you wouldn't have to change you code, except if you use isinstance, but this should lower a lot the amount of things that need to be reworked ;-)

Aug 28 '17 09:08 touilleMan

Why can't be we follow this approach?

class A(Document):
  b = ReferenceField(B)

When A is loaded, we already have B's id, so an instance of class B can be created with just id (with a flag on the class to denote its not been loaded yet). isintance would work correctly in this case.

Once __getattr__/__setattr__ is called a query to the DB could load the actual mongo document.

Aug 28 '17 09:08 sauravshah

The trouble B maybe have children classes,:

class B(Document):
    meta= {'allow_inheritance': True}


class Bchild(B):
    pass


class A(Document):
  b = ReferenceField(B)


b_child = Bchild().save()
a = A(b=b_child).save()

In this example you cannot know a.b is a Bchild instance before dereferencing it.

Aug 28 '17 09:08 touilleMan

Ah ok, I understand the problem now. This is not a big deal for us (and most projects I would assume).

For backward compatibility, is it possible to add a kwarg to LazyReference (maybe ignore_inheritance) and make isintance work when that kwarg is present?

isinstance is being used all over the place in django-mongoengine, so would be great to not dereference on it.

Aug 28 '17 10:08 sauravshah

As an interesting idea, it can know what it references prior to dereferencing if the _cls reference is stored in the DBRef (concrete; technically allowed via **kwargs inclusion in the resulting SON) or if stored similarly to a CahcedReferenceField that incorporates that value.

Sep 05 '17 20:09 amcgregor

Does anyone know if there is a patch in the works for this issue?

Jul 04 '18 07:07 benjhastings

@benjhastings #1690 is the solution, but it requires some change on your code (switching from ReferenceField to LazyReferenceField)

Jul 04 '18 09:07 touilleMan

How does that work if a DictField is used though as per https://stackoverflow.com/questions/35257305/mongoengine-is-very-slow-on-large-documents-compared-to-native-pymongo-usage/35274930#35274930 ?

Jul 04 '18 10:07 benjhastings

@benjhastings If you perf trouble comes from a too big document... well there is nothing that can save you right now :-( I guess the DictField could be improved (or a RawDictField could be created) to do no deserialization at all on the data.

Jul 04 '18 12:07 touilleMan

I have written an alternative Document container type which preserves, internally, the MongoDB native value types rather than Python typecast values, casts only on actual attribute access (get/set) to the desired field, and is directly usable with PyMongo base APIs as if it were a dictionary; no conversion on use. No eager bulk conversion of records' values as they stream in, which is a large part of the overhead, and similarly, no eager dereferencing (additional independent read queries for each record loaded) going with a hardline interpretation of "explicit is better than implicit". Relevant links:

Document (MutableMapping proxy to an ordered dictionary)
Container (underlying declarative base class)
Reference (Field subclass)

Use of a Reference(Foo, concrete=True, cache=['_cls']) would store an import path reference (e.g. "myapp.model:Foo") within the DBRef. (If Foo is a model that allows subclassing; typically by inheriting the Derived trait which defines and automates the calculation of a _cls field import reference.)

Jul 04 '18 12:07 amcgregor

...well...i just got annoyed enough with mongoengine enough to google what's what to find this...great.

should be on current versions of pymongo and mongoengine per pip install -U.

here's my output a la @apolkosnik: dict: viz_dict

embed: viz_embed

console: pymongo with dict took 0.06s pymongo with embed took 0.06s mongoengine with dict took 16.72s mongoengine with embed took 0.74s mongoengine with dict as_pymongo() took 0.06s mongoengine with embed as_pymongo() took 0.06s mongoengine aggregation with dict took 0.11s mongoengine aggregation with embed took 0.11s

if DictField is the issue, then please for the love of all that is holy, let us know what to change it to or fix it. watching mongo and pymongo respond almost immediately and then waiting close to a minute for mongoengine to...do whatever it's doing...kind of a massive bottleneck. dig the rest of the package, but if this can't be resolved on the package side...

Jul 23 '18 05:07 shr00mie

cricket...cricket...

oh look at that. pymodm. and to porting we go.

Aug 02 '18 01:08 shr00mie

Just hit this bottleneck inserting a couple ~5MB documents. Pretty much a deal breaker, having an insert that takes less than a second with pymongo take over a minute with MongoEngine.

Apr 02 '19 21:04 nickfaughey

@nickfaughey I switched to pymodm. It took very little if any modification with my existing code and is lightning fast. And by MongoDB, so ongoing development.

Apr 02 '19 21:04 shr00mie

mongoengine mongoengine copied to clipboard

Mongoengine is very slow on large documents compared to native pymongo usage

mongoengine
mongoengine copied to clipboard