elasticsearch-dsl-py icon indicating copy to clipboard operation
elasticsearch-dsl-py copied to clipboard

Improve usage of bulk with Document

Open dsanders11 opened this issue 6 years ago • 7 comments

I know this has been brought up before (#843) but I think it could use a re-visit. The suggested approach of using to_dict is starting to become less adequate as more features are added to Document.update, Document.delete, and Document.save. Using bulk with to_dict requires re-implementing things like optimistic concurrency since that's a feature layered on top of the result from to_dict.

Perhaps there is an API which can be created which pulls out the logic from Document.update, Document.delete, and Document.save so that it can be reused for bulk?

Just brainstorming, but perhaps something like a new to_action method?

class Article(Document):
    def to_action(self, action, using=None, index=None, **kwargs):
        """
        Create action for document. For keyword arguments for each action,
        see the ``update``, ``delete``, and ``save`` methods.

        :arg action Action to create, one of 'index', 'create', 'delete' or 'update'
        :arg index: elasticsearch index to use, if the ``Document`` is
            associated with an index this can be omitted.
        :arg using: connection alias to use, defaults to ``'default'``
        """

helpers.bulk(es, (article.to_action('index') for article in articles))

Another option could be to have different methods for each action, which would allow the different arguments for each action to be more clearly broken out.

Yet another option would be to create a kind of ActionBuffer class which could be provided to the individual Document.update, Document.delete, and Document.save methods and then retrieved with ActionBuffer.get_actions(). However, now that these methods return values, this style of API wouldn't work as well since the return value would have to be something like 'pending'.

@HonzaKral, any thoughts on a API to make it easier to use Document with bulk?

dsanders11 avatar May 17 '19 19:05 dsanders11

Thanks for raising this issue, @dsanders11!

I was wondering recently about the same thing - how to make bulk usage easier but didn't go as far as you in the design. My idea was to create a classmethod bulk that would properly index all Document instances passed in to it using the bulk helper.

I am not sure which is better (or if we can do both) - I like the simplicity of the .bulk(docs, using=None, index=None) approach as it is much easier to use and requires less knowledge about the internals. Your solution is then way more flexible but doesn't differ much from what is already possible by calling to_dict(True) (delete would only require dropping _source from the dict, update is of course harder).

What do you think?

btw I slightly edited your code in your example to make it a generator expression instead of a list comprehension to avoid loading all documents into memory. Just in case someone copy-pastes the example, which is really good and useful, as-is.

honzakral avatar May 20 '19 18:05 honzakral

btw I slightly edited your code in your example to make it a generator expression instead of a list comprehension to avoid loading all documents into memory.

👍

I am not sure which is better (or if we can do both) - I like the simplicity of the .bulk(docs, using=None, index=None) approach as it is much easier to use and requires less knowledge about the internals.

This is true, although I'd name it something more like bulk_index since just bulk is a bit ambiguous if it's only doing indexing.

Your solution is then way more flexible but doesn't differ much from what is already possible by calling to_dict(True)

It does differ in some important ways, IMO. First, it would automatically get support for any new functionality such as optimistic concurrency, without users needing to write that code themselves. Second, it requires less internal knowledge about elasticsearch-dsl-py. To currently replicate the behavior of Document.save using just Document.to_dict(True) you need to also handle the meta fields, and now optimistic concurrency as well.

My use-case is a library which allows for easily indexing Django models. It's a spin-off of the unmaintained django-elasticsearch-dsl. I'm trying to de-couple the library as much as possible from elasticsearch-dsl-py, as the tight coupling in the original project led to it its maintenance issues - you can't easily use new versions of elasticsearch-dsl-py as that project chose to use inheritance, making it fragile when changes to elasticsearch-dsl-py happen. My spin-off project opts to wrap instead, creating a DjangoModelDocument class which internally builds a Document and wraps the important methods, providing less coupling since it's only touching the external API of Document.

As such I'm weary to need to use Document.to_dict(True) and write the code handling meta fields, optimistic concurrency, etc. It will make a library that needs little to no maintenance require more maintenance to keep that code up-to-date, when it's essentially duplicating the code that exists in elasticsearch-dsl-py.

dsanders11 avatar May 20 '19 21:05 dsanders11

To currently replicate the behavior of Document.save using just Document.to_dict(True) you need to also handle the meta fields, and now optimistic concurrency as well.

Actually to_dict(True) should include the Optimistic concurrency control, that is a very good point as that is its purpose, thanks for bringing that up, it completely slipped my mind when adding the new code to save(). Would that help?

you can't easily use new versions of elasticsearch-dsl-py as that project chose to use inheritance, making it fragile when changes to elasticsearch-dsl-py happen.

There was a big refactoring that happened because of the doc type removal and moving to indices, now it should definitely be more stable as there are no outstanding issues

honzakral avatar May 20 '19 22:05 honzakral

Actually to_dict(True) should include the Optimistic concurrency control, that is a very good point as that is its purpose

If more stuff got folded into to_dict(include_meta=True) then I think that would address most of my concern. Then Document.save would be more of a thin wrapper around to_dict. Merging optimistic concurrency control into to_dict(include_meta=True) may be a little tricky though, since es.index expects the meta info to be outside of body. Perhaps to_dict could be extended (so as not to break the current API) to have to_dict(include_meta=True, separate_meta=True) which would have it return body, meta instead of them combined into a single dict. Then internally Document.save would use separate_meta=True, and for use of the bulk API it would still be a single dict.

dsanders11 avatar May 20 '19 22:05 dsanders11

Not completely related, but I think it would save people some research if there was a bulk helper in elasticsearch-dsl itself to handle Document instances, just so we don't need to import the low level library just for this. Maybe something like this in a helpers file would be useful for more people?

from elasticsearch.helpers import bulk as _bulk
def bulk(client, actions, stats_only=False, *args, **kwargs):
    include_meta = kwargs.pop("include_meta", False)
    skip_empty = kwargs.pop("skip_empty", True)
    actions = (i.to_dict(include_meta=include_meta, skip_empty=skip_empty) for i in actions)
    return  _bulk(client, actions, stats_only, args, kwargs)

Thanks.

dpasqualin avatar Dec 10 '19 13:12 dpasqualin

I'm sorry to bump this again, but what is the current best way to add many documents to an index? I can do it as in #403, but I'm unsure if that's very good, as someone mentioned some sort of helpers that might be missing. If there's not a better way, what do I need to watch out for when doing this? Right now, I'm just looping over all the data I want to add and saving each document separately, but this ends up being pretty slow (takes a good 1-2s to upload around 14kb) and I can't easily set refresh to true, as that would just slow things down even more, meaning I might need to add time.sleep() all over, and that's just no good. Any help would be appreciated!

Cobular avatar Sep 10 '20 21:09 Cobular

Another issue when dealing with bulk actions, is when the document to bulk are not associated to an index or to a wildcard index. We cannot inject the index as in save, update.

We should be able to override the index in the to_dict method. I had to to something like this:

def gendata():
    for doc in documents:
        doc.meta.index = write_index_alias
        yield doc.to_dict(include_meta=True)

     elasticsearch.helpers.bulk(get_elasticsearch_client(), gendata())

instead of somehting like:

def gendata():
    for doc in documents:
        yield doc.to_dict(index=write_index_alias, include_meta=True)

    elasticsearch.helpers.bulk(get_elasticsearch_client(), gendata())

omoumniabdou avatar Mar 31 '22 13:03 omoumniabdou