elasticsuite icon indicating copy to clipboard operation
elasticsuite copied to clipboard

_mtermvectors returns different response in consecutive calls

Open kaplansin opened this issue 2 years ago • 8 comments

_mtermvectors request returns different response in consecutive calls; doc_freq value is missing and term_statistics are different. This causes to make fuzzy search instead of exact search

$curl -XPOST 'http://127.0.0.1:9200/_mtermvectors?pretty=true' -d '{"docs":[{"_index":"<elastic_index_name>","_type":"_doc","term_statistics":true,"fields":["spelling","spelling.whitespace"],"doc":{"spelling":"4195218"}}]}' [] [] -H "Content-Type: application/json" 

example responses with missing doc_freq;

  "docs" : [
    {
      "_index" : "<elastic_index_name>,
      "_type" : "_doc",
      "_version" : 0,
      "found" : true,
      "took" : 0,
      "term_vectors" : {
        "spelling" : {
          "field_statistics" : {
            "sum_doc_freq" : 139292,
            "doc_count" : 14320,
            "sum_ttf" : 157372
          },
          "terms" : {
            "4195218" : {
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 0,
                  "start_offset" : 0,
                  "end_offset" : 7
                }
              ]
            }
          }
        },
        "spelling.whitespace" : {
          "field_statistics" : {
            "sum_doc_freq" : 139312,
            "doc_count" : 14320,
            "sum_ttf" : 157372
          },
          "terms" : {
            "4195218" : {
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 0,
                  "start_offset" : 0,
                  "end_offset" : 7
                }
              ]
            }
          }
        }
      }
    }
  ]
}

Example Response with doc_freq

{
  "docs" : [
    {
      "_index" : "elastic_index_name",
      "_type" : "_doc",
      "_version" : 0,
      "found" : true,
      "took" : 0,
      "term_vectors" : {
        "spelling" : {
          "field_statistics" : {
            "sum_doc_freq" : 140387,
            "doc_count" : 14419,
            "sum_ttf" : 158358
          },
          "terms" : {
            "4195218" : {
              "doc_freq" : 1,
              "ttf" : 1,
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 0,
                  "start_offset" : 0,
                  "end_offset" : 7
                }
              ]
            }
          }
        },
        "spelling.whitespace" : {
          "field_statistics" : {
            "sum_doc_freq" : 140424,
            "doc_count" : 14419,
            "sum_ttf" : 158358
          },
          "terms" : {
            "4195218" : {
              "doc_freq" : 1,
              "ttf" : 1,
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 0,
                  "start_offset" : 0,
                  "end_offset" : 7
                }
              ]
            }
          }
        }
      }
    }
  ]
}

Preconditions

Elastic search version 7.10.2 with 3 Nodes

Magento Version :Enterprise version

ElasticSuite Version : 2.10.10

Environment : Production

Third party modules :

Steps to reproduce

Expected result

  1. Response should be same

Actual result

  1. [Screenshot, logs]

kaplansin avatar Jul 28 '22 09:07 kaplansin

@kaplansin how much shards do you have ?

term vectors API only compute results in a particular shard, which will cause it to have different results from time to time.

see https://www.elastic.co/guide/en/elasticsearch/reference/7.17/docs-termvectors.html#docs-termvectors-api-behavior

That's particularly visible when searching for "unique" strings like SKUs, a given SKU will exist in one shard, but not in another.

That being said, you probably don't need more than one shard, except if your index size is several Gb.

Let me know how much shards you have actually and we'll be able to pursue this discussion. In any case, that's a very good catch, not many people did dig down into the termVectors shenanigans of Elasticsuite and went out with a proper understanding of them.

romainruaud avatar Jul 28 '22 10:07 romainruaud

Hi @romainruaud thanks for the quick response,

curl http://localhost:9200/_cluster/health?pretty

{
  "cluster_name" : "elasticsearch_stg",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 30,
  "active_shards" : 72,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 21,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 77.41935483870968
}

and our magento setting

Screenshot from 2022-07-28 13-18-01

And here our indexes' size Screenshot from 2022-07-28 13-20-49

kaplansin avatar Jul 28 '22 10:07 kaplansin

Ok, 6Gb could fit well in one shard.

I guess you are on Adobe Commerce Cloud so you probably cannot change this yourself.

Ask the Adobe team to use these parameters :

Number of shards : 1 Number of replicas : 2

You have 3 nodes, so you only need to have 2 replicas :

  • node 1 will have the shard
  • node 2 will have a replica (and it will be used for reading)
  • node 3 will also have a replica (and it will also be used for reading).

Let me know if this improves your issue, but that's something we faced a lot of time, reducing to one shard should be ok.

Regards

romainruaud avatar Jul 28 '22 11:07 romainruaud

I was about to reply the same thing as @romainruaud. It should help with the issue of your many "unassigned shards" and the yellow status of your cluster: you were asking for too many replicas (and hence shards) that your cluster could handle without having duplicates.

rbayet avatar Jul 28 '22 11:07 rbayet

What about using preference parameter We are trying to execute our query on single node but selecting all shards , but the result is same It seems _mtermvectors is not listening our preference query parameter.

$curl -XPOST 'http://127.0.0.1:9200/_mtermvectors?pretty=true&preference=_shards:0,1,2|_only_node:i-0bab91ad6fd375394' -d '{"docs": [{"_index": "<elastic_index_name>","_type": "_doc","term_statistics": true,"fields": ["spelling","spelling.whitespace"],"doc": {"spelling": "4554185"}}]}' -H "Content-Type: application/json"

kaplansin avatar Jul 28 '22 14:07 kaplansin

I never used the preference parameter, maybe your syntax is not ok ?

In any case you should change at least your replica number according to @rbayet answer (your cluster is yellow). I still suggest you to switch also the number of shards to 1.

Regards

romainruaud avatar Jul 28 '22 16:07 romainruaud

Hi @romainruaud as described in here we can specify shards or nodes by preference parameter. https://www.elastic.co/guide/en/elasticsearch/reference/7.10/docs-multi-termvectors.html

but it is not working as expected, it always uses random node/shard.

kaplansin avatar Jul 29 '22 07:07 kaplansin

Ok, as I said, we did not test this before, so maybe it's not working as intended internally with Elasticsearch.

In any case, this would not be doable "as is" with the current code base, it would require injecting the targeted shards in the _mtermvectors request.

So for know, our suggestion remains the same : switch 1 shard / 2 replicas.

Thank you

romainruaud avatar Jul 29 '22 11:07 romainruaud

This issue was waiting update from the author for too long. Without any update, we are unfortunately not sure how to resolve this issue. We are therefore reluctantly going to close this bug for now. Please don't hesitate to comment on the bug if you have any more information for us; we will reopen it right away! Thanks for your contribution.

github-actions[bot] avatar Aug 12 '22 13:08 github-actions[bot]