mongoengine icon indicating copy to clipboard operation
mongoengine copied to clipboard

Performance issue in 0.21+ with count()

Open amir20 opened this issue 4 years ago • 15 comments

I am currently on 0.20.0 for mongoengine.

I have a route /healthcheck which does something like Foo.object.count() to make sure mongo connection is valid.

When I try upgrading mongoengine to 0.21.0 my healthcheck started timing out. As a result, my app never looked healthy so docker killed it. I waited a couple months and noticed there was a new version.

With 0.22.x, I am still having the same issue.

I don't know how to go about debugging this and it is hard to reproduce locally because I only have 1% of production data. It seems like it just hangs.

Has this been issue for anybody else?

amir20 avatar Feb 12 '21 17:02 amir20

Could you run the same query with mongo/pymongo? There was a change related with .count in latest releases due to a deprecation in pymongo and performance degradation were to expect unfortunately.

bagerard avatar Feb 12 '21 22:02 bagerard

Should I upgrade mongoengine first? Or just run with pymongo which is currently frozen at pymongo==3.11.3?

amir20 avatar Feb 12 '21 22:02 amir20

Hmm so interesting...

With mongo, it works as expected and no error messages.

With pymongo, I see

<ipython-input-19-8a96c6b45f79>:1: DeprecationWarning: count is deprecated. Use estimated_document_count or count_documents instead. Please note that $where must be replaced by $expr, $near must be replaced by $geoWithin with $center, and $nearSphere must be replaced by $geoWithin with $centerSphere
  db.player.count()

I tried estimated_document_count() and it also times out after 30 seconds.

Is there a new way to get count() that I am not aware of?

amir20 avatar Feb 12 '21 23:02 amir20

I have been looking through documentation and can't figure out how to call estimated_document_count(). Is it possible with mongoengine?

amir20 avatar Feb 13 '21 03:02 amir20

I was able to get estimated document to work with Foo._get_collection().estimated_document_count().

I don't think Mongoengine has a way to query using estimated document count.

amir20 avatar Feb 14 '21 23:02 amir20

It would be valuable to add support for estimated_count. Since its an estimated count, it could not be the default behavior of .count() (which in current version, relies on count_documents)

bagerard avatar Feb 15 '21 08:02 bagerard

My $0.02 is:

  1. For queries like, Foo.objects.count() should use estimated_document_count()
  2. For queries like, Foo.objects(some_value=...).count() should use document_count()

I think that makes most sense.

amir20 avatar Feb 16 '21 20:02 amir20

Agreed with @amir20

Add support for new pymongo cursor operations document_count() and estimated_document_count().

count() should have deprication warning and use estimated_document_count() so that the functionality doens't change suddenly like it it did for us and everyone when upgrading. Impact is pretty bad if you have big collections.

vainu-jussi avatar May 06 '21 10:05 vainu-jussi

So... I don't really get it, should we now directly use pyMongo's db.collection.count_documents(query) instead of mongoengine Collection.objects(query).count()?

Our performance has gone pretty bad since that update and we have so many count() in our code. Thanks in advance.

EDIT: ok I got it now, mongoengine now uses pyMongo's count_documents(), and it's too slow... Maybe we will downgrade the version of mongoengine or directly use pyMongo's deprecated count()...

ugoQ avatar Jun 03 '21 13:06 ugoQ

Feel free to help and push a MR, amir's suggestion sounds good

bagerard avatar Jun 03 '21 20:06 bagerard

We have modified the way we count things. Main problem was when there was no query (filter) at all. So now we just switch from count() to estimated_documents_count() when it's necessary (we dynamically populate our queries so we never use the notation Foo.objects.count()) and it seems there is no performance problems anymore

ugoQ avatar Jun 04 '21 12:06 ugoQ

Alright, I've started working on this. It's actually pretty simple to change the current count() behavior (which relies on count_documents only) and make it use count_documents/estimated_count_documents based on whether or not there is a query.

There is additional rules that are required as the 2 methods don't support the same parameters (e.g estimated_count_documents supports hint, but not collation unlike count_documents) but we could get something working with simple tweaks.

Out of curiosity, I looked at how Mongoose was dealing with this and they took another direction, they are actually exposing the 3 methods on their Model API:

  • count : here
  • count_documents: here
  • estimated_count_documents: here

I believe this is the most flexible solution and as it provides the ability to users to use or or another.

Keeping that in mind, what I have in mind is:

  • have Model.objects.count() using count_documents or estimated_count_documents based on a few rules of thumb. This will fit most use cases, but not necessarily all of them
  • later (and based on user feedback), add count_documents() and estimated_count_documents() so that users will have a way to force one or the other

bagerard avatar Jun 05 '21 21:06 bagerard

Any progress here? Just the getting support for the new count_documents() and estimated_count_documents() would be a great start. This would make it possible to fix it without reverting to old versions.

vainu-jussi avatar Aug 03 '21 13:08 vainu-jussi

There seems to MR waiting for review this since the summer at https://github.com/MongoEngine/mongoengine/pull/2529 from @bagerard

Any change of getting it through?

vainu-jussi avatar Dec 09 '21 08:12 vainu-jussi