django-watson icon indicating copy to clipboard operation
django-watson copied to clipboard

Support async updating of search index

Open valentijnscholten opened this issue 4 years ago • 1 comments

I'm using watson in a django app that has as one of its most important features the importing of files to turn them into database rows, i.e. Django ORM model instances.

Using bulk_create with django is problematic, especially in combination with MySQL due to the ids of the created objects being unknown. So I am thinking about ways to make the import faster, and one way would be to make the watson search index updates asynchronous. An issue is that some model instances are updated (saved) multiple times within one transaction, triggering multiple watson updates.

My thoughts so far:

  • Make the post_save signal optional and allow the django app itself to update the index in the best way possible, i.e. some celery task already used by my app. This would need a (documented/supported) way to update one or more model instances. This would support deduplication of updates and could be asynchronous. Something similar could be achieved by wrapping the code in the skip_index_update decorator.

  • Then I found the (undocumented?) SearchContextMiddleware which already seems to deduplicate model updates within the same request and batches the index updates all together at the end of the request. This achieves deduplication, but is not yet asynchronous.

What possible solutions could be implemented?

Could there be some support in django-watson to support this scenario? Or would it make more sense that a django app just subclasses the middleware and wraps the search_context_manager.end() in a celery task?

Just thinking out loud here and maybe helping others trying to achieve the same.

valentijnscholten avatar Mar 15 '20 14:03 valentijnscholten

It's an interesting idea. However, async updates feels a bit niche, and there's so many possible frameworks to choose from it feels like it would be little-used.

I wonder if there's much performance advantage to performing async index updates. Given it's all in the same DB, it feels like batching it all in the same transaction using SearchContextMiddleware is going to be pretty optimal for more cases. If you need async updates, it's probably better to save the primary models AND the watson models in the background task together.

On Sun, 15 Mar 2020 at 14:20, valentijnscholten [email protected] wrote:

I'm using watson in a django app that has as one of its most important features the importing of files to turn them into database rows, i.e. Django ORM model instances.

Using bulk_create with django is problematic, especially in combination with MySQL due to the ids of the created objects being unknown. So I am thinking about ways to make the import faster, and one way would be to make the watson search index updates asynchronous. An issue is that some model instances are updated (saved) multiple times within one transaction, triggering multiple watson updates.

My thoughts so far:

Make the post_save signal optional and allow the django app itself to update the index in the best way possible, i.e. some celery task already used by my app. This would need a (documented/supported) way to update one or more model instances. This would support deduplication of updates and could be asynchronous.

Then I found the (undocumented?) SearchContextMiddleware which already seems to deduplicate model updates within the same request and batches the index updates all together at the end of the request. This achieves deduplication, but is not yet asynchronous.

What possible solutions could be implemented?

Could there be some support in django-watson to support this scenario? Or would it make more sense that a django app just subclasses the middleware and wraps the search_context_manager.end() in a celery task?

Just thinking out loud here and maybe helping others trying to achieve the same.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/issues/269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEKCALB5XZK4K4HHD2VM3RHTP2JANCNFSM4LLAX7YQ .

etianen avatar Mar 23 '20 10:03 etianen