elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

percolate query : add 'document_index' option

Open fbaligand opened this issue 6 years ago • 9 comments

Since version 6.0, an index could have only one type. For percolate use case, that's quite annoying, because 'query' mapping and 'document' mapping are different (it's clearly not the same thing).

That's why it would be great to add an option 'document_index' so that we can search queries in one index, and indicate that the submitted document matches another index mapping.

With this option, we could execute such a query :

GET /queries-index/_search
{
    "query" : {
        "percolate" : {
            "field" : "query",
            "document" : {
                "message" : "my document content"
            },
            "document_index": "documents-index"
        }
    }
}

fbaligand avatar May 14 '18 14:05 fbaligand

Pinging @elastic/es-search-aggs

elasticmachine avatar May 14 '18 17:05 elasticmachine

Your proposal cannot be implemented easily, because we need to have mappings available locally which is only possible for the queried index. That said we understand that it's a bit annoying to have to put mappings of documents and queries in the same mappings and will try to brainstorm about what can be done.

jpountz avatar May 18 '18 13:05 jpountz

Nice to see you will try to brainstorm on it !

Up to me, it has another advantage : When you reference an existing doc (with its id), you can reference a doc in another index. And like that, you can store queries in one index, and documents in another one. It would be powerful and useful, given that index size, settings tuning (shard count, routing, ...) could be quite different.

fbaligand avatar May 18 '18 16:05 fbaligand

We have been thinking about this and in order to avoid that the document fields and query fields need to sit in the same mapping, we think it would be better if the document fields are defined in the percolator field type to make it more clear that these fields are needed only by the percolator to index the percolator documents at query time.

PUT /my-query-index
{
  "mappings": {
    "_doc" : {
      "properties": {
        "query" : {
          "type": "percolator",
          "document_properties": {
            "body": {
              "type": "text"
            },
            "subject": {
              "type": "text"
            },
            "sent_at": {
              "type": "date"
            }
          }
        },
        "username": {
          "type": "keyword"
        }
      }
    }
  }
}

In this case body, subject, sent_at are the document fields and username is an extra query field. These document fields can only be used by the percolator and no regular documents can use these fields for indexing. I think this makes document fields clearer and for what they can be used for. @fbaligand What do you think?

When you reference an existing doc (with its id), you can reference a doc in another index. And like that, you can store queries in one index, and documents in another one.

That index, type and id that is specified is used internally to create a get request to fetch the document to be percolated. The fields of the document are still required in the query mapping alongside the percolator field that is used to store the percolator query.

martijnvg avatar May 31 '18 05:05 martijnvg

Hi @martijnvg,

Well, first, thank you very much for having thought about the design to provide such a feature. Then, it is an interesting way to separate "query" fields from "percolated document" fields.

That said, I see some potential limitations to this approach :

  • if I use a percolate query to categorize a not yet indexed document, and then I index the enriched document (with category info) in another index, then I have to duplicate "percolated document" mapping into "the-enriched-documents-index" and "document_properties" into "queries" index. Same duplication problem for custom analyzers. I find very useful to index documents and queries, because they have not the same allocation needs (few queries, lots of documents), so it implies often not the same shard count and potentially not the same routing. BTW, this is a concern I have in my company.
  • Then, I'm worry about what this "new internal mapping" implies ? does it imply copy/paste code from a classical index mapping ? does it have all the features a classical index mapping have ? Or is it limited ? In the future, if a new feature is implemented in mapping, would it be available for this "new percolate internal mapping" ? What about custom analyzers ? Could a "document_property" use a custom analyzer defined globally in "queries" index ?

If I understand well, "percolated document" mapping is required at "queries" indexation time, right ? Rather than define a "document_properties" option inside "percolator" field, what do you think about a "document_index" option ? It would allow to reference an external mapping defined in another index.

I just try to see all the use cases, the potential limitations, and the alternatives. Maybe my thoughts are wrong or not relevant :) Tell me :)

fbaligand avatar Jun 02 '18 16:06 fbaligand

Thanks for responding @fbaligand.

The goal of this idea is to reduce the confusion what field mappings belong the the query document (percolator field and other metadata fields about the query) and what field mappings belong to the document being percolated (both the fields inside this document and fields used inside the percolator query (in the match, range and other queries)). This, I think will improve the general understanding of the percolator.

if I use a percolate query to categorize a not yet indexed document, and then I index the enriched document (with category info) in another index, then I have to duplicate "percolated document" mapping into "the-enriched-documents-index" and "document_properties" into "queries" index. Same duplication problem for custom analyzers.

The duplication that you have to do today remains with this new approach of defining the mappings for the documents that are being percolated.

Then, I'm worry about what this "new internal mapping" implies ? does it imply copy/paste code from a classical index mapping ? does it have all the features a classical index mapping have ? Or is it limited ?

No, it is not limited. All mappings that can be defined for regular documents can also be defined under the document_properties inside the percolator mapping type.

What about custom analyzers ? Could a "document_property" use a custom analyzer defined globally in "queries" index ?

Yes, field types defined in document_properties can also use custom analyzer defined in the queries index.

If I understand well, "percolated document" mapping is required at "queries" indexation time, right ?

Yes, the fields used inside the percolator queries need to exist at the time the percolator queries are indexed, but this is also required today.

Rather than define a "document_properties" option inside "percolator" field, what do you think about a "document_index" option?

Ok, so instead of defining the document_properties yourself and keeping them up to date, let the percolator field mapper use the mapping and analysis settings from the defined document_index. I think that is a good idea and it builds on top of the document_properties idea I shared previously. However currently that is a bit harder to implement, because at the field mapper level inside ES there is no reference to other indices or their mappings, but I think this is something that can be implemented at some point in the future (but the document_properties should be implemented first and then document_index can be built on top of that).

Also I think document_index should be named differently as it can be confusing with the current index option (that just fetches a document to percolate from a different index and document_index just fetches the mapping / analysis settings).

martijnvg avatar Jun 04 '18 08:06 martijnvg

Has the proposed property "document_properties" been rejected or a different solution implemented in the meanwhile? (it was a very good idea IMHO)

tarzasai avatar Sep 15 '22 15:09 tarzasai

I believe we made no progress on this, right @martijnvg ?

javanna avatar Sep 19 '22 19:09 javanna

I believe we made no progress on this, right @martijnvg ?

Yes, no progress has been made.

martijnvg avatar Sep 20 '22 05:09 martijnvg