elasticsearch
elasticsearch copied to clipboard
Using reindex api with slices in creating child documents have different routing id from parent id
Elasticsearch version (bin/elasticsearch --version
): 6.1.3
JVM version (java -version
): 1.8-56
OS version (uname -a
if on a Unix-like system): Ubuntu 16 LTS
Description of the problem including expected versus actual behavior:
When creating child documents using reindex api with slicing and painless scripting, I noticed that value for routing id was different from the parent id. When I re-tried without slicing I noticed expected behavior. To isolate the issue, I created single node single shard index.
Steps to reproduce:
- I first indexed all the parent docuements. I made following call using reindex API
POST _reindex?slices=9&requests_per_second=-1&wait_for_completion=true
{
"source": {
"index": "index-cutover-v2",
"type": "files",
"query": {"terms": {"filesets": [31712]}}
},
"script": {
"source": "if (ctx._type == 'files') {ctx._source.file_flag_join = params.file_flag_join; ctx._id=ctx._source.id+'file'; ctx._type='doc' } else {ctx.op = 'noop'}",
"lang": "painless",
"params": {
"file_flag_join" : {
"name": "file"
}
}
},
"dest": {
"index": "index-cutover-v5"
}
}
- Once the reindexing was done I used _refresh API to make data available for searching.
- After step 2, I indexed child documents using following api call
POST _reindex?slices=9&requests_per_second=-1&wait_for_completion=true
{
"source": {
"index": "index-cutover-v2",
"type": "flags",
"query": {"terms": {"filesets": [31712]}}
},
"script": {
"source": "if (ctx._type == 'flags') {ctx._type='doc';ctx._id=ctx._source.id+'flag';ctx._routing=ctx._source.file+'file';ctx._source.file_flag_join = params.file_flag_join;ctx._source.file_flag_join.parent = ctx._routing;} else {ctx.op = 'noop'}",
"lang": "painless",
"params": {
"file_flag_join" : {
"name": "flag"
}
}
},
"dest": {
"index": "index-cutover-v5"
}
}
"
- After step 3 was finished I reran _refresh API.
- I tried to pull single flag document and I received following.
{
"_index": "index-cutover-v5",
"_type": "doc",
"_id": "1186231450flag",
"_score": 2,
"_routing": "110486171file",
"_source": {
"severity": 4,
"reason": "",
"file_flag_join": {
"parent": "110486274file",
"name": "flag"
},
"file": 110486171,
"offset": 10111689,
"filesets": [
29825,
31712
],
"id": 1186231450
}
}
If you notice, _routing and file_flag_join.parent have different values.
If I remove slicing and try again I dont see this behavior and I see consistency in the routing and parent values.
The issue is caused by the inner map file_flag_join
in the params
of the script.
If this inner map is copied in the context document and then modified to add new fields, this also modifies the original params
. It works when slicing is off because you always override the new field with its new value but when multiple slices execute the same script they modify this inner map concurrently which leads to a race condition.
The solution is to clone the map in the source
like this:
ctx._source.file_flag_join = new HashMap(params.file_flag_join)
We could force a deep copy of the params
before execution or make the inner map immutable to avoid this trappy behavior, @nik9000 WDYT ?
I think it'd be breaking to make those maps immutable but it is pretty tempting. It feels cleaner than making a copy every time though.
@nik9000 what should we do here? this feels like serious bug?
Either of the things I suggested should work to be honest. It might be better to make a script context for reindex and see if that solves the problem. It might. And if it doesn't we can build on it from there.
Pinging @elastic/es-distributed
@henningandersen and I discussed this issue during a triage session and we think this issue should be moved to the script label as the issue seems to be related to how the script is set up using a non thread-safe data structure in the script (maybe this should be forbidden?)
Pinging @elastic/es-core-infra (Team:Core/Infra)
This is an on-going issue with params, we should provide a way for users to access params in a safe way. Since we have metadata()
and field(<path>)
as replacements for ctx
, perhaps a fields-like params API would be worthwhile.