elasticsuite
elasticsuite copied to clipboard
_mtermvectors returns different response in consecutive calls
_mtermvectors request returns different response in consecutive calls; doc_freq value is missing and term_statistics are different. This causes to make fuzzy search instead of exact search
$curl -XPOST 'http://127.0.0.1:9200/_mtermvectors?pretty=true' -d '{"docs":[{"_index":"<elastic_index_name>","_type":"_doc","term_statistics":true,"fields":["spelling","spelling.whitespace"],"doc":{"spelling":"4195218"}}]}' [] [] -H "Content-Type: application/json"
example responses with missing doc_freq;
"docs" : [
{
"_index" : "<elastic_index_name>,
"_type" : "_doc",
"_version" : 0,
"found" : true,
"took" : 0,
"term_vectors" : {
"spelling" : {
"field_statistics" : {
"sum_doc_freq" : 139292,
"doc_count" : 14320,
"sum_ttf" : 157372
},
"terms" : {
"4195218" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 7
}
]
}
}
},
"spelling.whitespace" : {
"field_statistics" : {
"sum_doc_freq" : 139312,
"doc_count" : 14320,
"sum_ttf" : 157372
},
"terms" : {
"4195218" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 7
}
]
}
}
}
}
}
]
}
Example Response with doc_freq
{
"docs" : [
{
"_index" : "elastic_index_name",
"_type" : "_doc",
"_version" : 0,
"found" : true,
"took" : 0,
"term_vectors" : {
"spelling" : {
"field_statistics" : {
"sum_doc_freq" : 140387,
"doc_count" : 14419,
"sum_ttf" : 158358
},
"terms" : {
"4195218" : {
"doc_freq" : 1,
"ttf" : 1,
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 7
}
]
}
}
},
"spelling.whitespace" : {
"field_statistics" : {
"sum_doc_freq" : 140424,
"doc_count" : 14419,
"sum_ttf" : 158358
},
"terms" : {
"4195218" : {
"doc_freq" : 1,
"ttf" : 1,
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 7
}
]
}
}
}
}
}
]
}
Preconditions
Elastic search version 7.10.2 with 3 Nodes
Magento Version :Enterprise version
ElasticSuite Version : 2.10.10
Environment : Production
Third party modules :
Steps to reproduce
Expected result
- Response should be same
Actual result
- [Screenshot, logs]
@kaplansin how much shards do you have ?
term vectors API only compute results in a particular shard, which will cause it to have different results from time to time.
see https://www.elastic.co/guide/en/elasticsearch/reference/7.17/docs-termvectors.html#docs-termvectors-api-behavior
That's particularly visible when searching for "unique" strings like SKUs, a given SKU will exist in one shard, but not in another.
That being said, you probably don't need more than one shard, except if your index size is several Gb.
Let me know how much shards you have actually and we'll be able to pursue this discussion. In any case, that's a very good catch, not many people did dig down into the termVectors shenanigans of Elasticsuite and went out with a proper understanding of them.
Hi @romainruaud thanks for the quick response,
curl http://localhost:9200/_cluster/health?pretty
{
"cluster_name" : "elasticsearch_stg",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 30,
"active_shards" : 72,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 21,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 77.41935483870968
}
and our magento setting
And here our indexes' size
Ok, 6Gb could fit well in one shard.
I guess you are on Adobe Commerce Cloud so you probably cannot change this yourself.
Ask the Adobe team to use these parameters :
Number of shards : 1 Number of replicas : 2
You have 3 nodes, so you only need to have 2 replicas :
- node 1 will have the shard
- node 2 will have a replica (and it will be used for reading)
- node 3 will also have a replica (and it will also be used for reading).
Let me know if this improves your issue, but that's something we faced a lot of time, reducing to one shard should be ok.
Regards
I was about to reply the same thing as @romainruaud. It should help with the issue of your many "unassigned shards" and the yellow status of your cluster: you were asking for too many replicas (and hence shards) that your cluster could handle without having duplicates.
What about using preference parameter We are trying to execute our query on single node but selecting all shards , but the result is same It seems _mtermvectors is not listening our preference query parameter.
$curl -XPOST 'http://127.0.0.1:9200/_mtermvectors?pretty=true&preference=_shards:0,1,2|_only_node:i-0bab91ad6fd375394' -d '{"docs": [{"_index": "<elastic_index_name>","_type": "_doc","term_statistics": true,"fields": ["spelling","spelling.whitespace"],"doc": {"spelling": "4554185"}}]}' -H "Content-Type: application/json"
I never used the preference parameter, maybe your syntax is not ok ?
In any case you should change at least your replica number according to @rbayet answer (your cluster is yellow). I still suggest you to switch also the number of shards to 1.
Regards
Hi @romainruaud as described in here we can specify shards or nodes by preference parameter. https://www.elastic.co/guide/en/elasticsearch/reference/7.10/docs-multi-termvectors.html
but it is not working as expected, it always uses random node/shard.
Ok, as I said, we did not test this before, so maybe it's not working as intended internally with Elasticsearch.
In any case, this would not be doable "as is" with the current code base, it would require injecting the targeted shards in the _mtermvectors request.
So for know, our suggestion remains the same : switch 1 shard / 2 replicas.
Thank you
This issue was waiting update from the author for too long. Without any update, we are unfortunately not sure how to resolve this issue. We are therefore reluctantly going to close this bug for now. Please don't hesitate to comment on the bug if you have any more information for us; we will reopen it right away! Thanks for your contribution.