Performance issue in 0.21+ with count()
I am currently on 0.20.0 for mongoengine.
I have a route /healthcheck which does something like Foo.object.count() to make sure mongo connection is valid.
When I try upgrading mongoengine to 0.21.0 my healthcheck started timing out. As a result, my app never looked healthy so docker killed it. I waited a couple months and noticed there was a new version.
With 0.22.x, I am still having the same issue.
I don't know how to go about debugging this and it is hard to reproduce locally because I only have 1% of production data. It seems like it just hangs.
Has this been issue for anybody else?
Could you run the same query with mongo/pymongo? There was a change related with .count in latest releases due to a deprecation in pymongo and performance degradation were to expect unfortunately.
Should I upgrade mongoengine first? Or just run with pymongo which is currently frozen at pymongo==3.11.3?
Hmm so interesting...
With mongo, it works as expected and no error messages.
With pymongo, I see
<ipython-input-19-8a96c6b45f79>:1: DeprecationWarning: count is deprecated. Use estimated_document_count or count_documents instead. Please note that $where must be replaced by $expr, $near must be replaced by $geoWithin with $center, and $nearSphere must be replaced by $geoWithin with $centerSphere
db.player.count()
I tried estimated_document_count() and it also times out after 30 seconds.
Is there a new way to get count() that I am not aware of?
I have been looking through documentation and can't figure out how to call estimated_document_count(). Is it possible with mongoengine?
I was able to get estimated document to work with Foo._get_collection().estimated_document_count().
I don't think Mongoengine has a way to query using estimated document count.
It would be valuable to add support for estimated_count. Since its an estimated count, it could not be the default behavior of .count() (which in current version, relies on count_documents)
My $0.02 is:
- For queries like,
Foo.objects.count()should useestimated_document_count() - For queries like,
Foo.objects(some_value=...).count()should usedocument_count()
I think that makes most sense.
Agreed with @amir20
Add support for new pymongo cursor operations document_count() and estimated_document_count().
count() should have deprication warning and use estimated_document_count() so that the functionality doens't change suddenly like it it did for us and everyone when upgrading. Impact is pretty bad if you have big collections.
So... I don't really get it, should we now directly use pyMongo's db.collection.count_documents(query) instead of mongoengine Collection.objects(query).count()?
Our performance has gone pretty bad since that update and we have so many count() in our code.
Thanks in advance.
EDIT: ok I got it now, mongoengine now uses pyMongo's count_documents(), and it's too slow... Maybe we will downgrade the version of mongoengine or directly use pyMongo's deprecated count()...
Feel free to help and push a MR, amir's suggestion sounds good
We have modified the way we count things. Main problem was when there was no query (filter) at all.
So now we just switch from count() to estimated_documents_count() when it's necessary (we dynamically populate our queries so we never use the notation Foo.objects.count()) and it seems there is no performance problems anymore
Alright, I've started working on this. It's actually pretty simple to change the current count() behavior (which relies on count_documents only) and make it use count_documents/estimated_count_documents based on whether or not there is a query.
There is additional rules that are required as the 2 methods don't support the same parameters (e.g estimated_count_documents supports hint, but not collation unlike count_documents) but we could get something working with simple tweaks.
Out of curiosity, I looked at how Mongoose was dealing with this and they took another direction, they are actually exposing the 3 methods on their Model API:
I believe this is the most flexible solution and as it provides the ability to users to use or or another.
Keeping that in mind, what I have in mind is:
- have
Model.objects.count()usingcount_documentsorestimated_count_documentsbased on a few rules of thumb. This will fit most use cases, but not necessarily all of them - later (and based on user feedback), add
count_documents()andestimated_count_documents()so that users will have a way to force one or the other
Any progress here? Just the getting support for the new count_documents() and estimated_count_documents() would be a great start. This would make it possible to fix it without reverting to old versions.
There seems to MR waiting for review this since the summer at https://github.com/MongoEngine/mongoengine/pull/2529 from @bagerard
Any change of getting it through?