apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

"failed to parse field [error.grouping_name] of type [keyword] in document" after upgrading to 8.15

Open carsonip opened this issue 1 year ago • 4 comments

Full error:

{
  "error": {
    "root_cause": [
      {
        "type": "document_parsing_exception",
        "reason": "[1:2013] failed to parse field [error.grouping_name] of type [keyword] in document with id 'OcTp4ZEBkEnqNYvJrysL'. Preview of field's value: '14 UNAVAILABLE: '"
      }
    ],
    "type": "document_parsing_exception",
    "reason": "[1:2013] failed to parse field [error.grouping_name] of type [keyword] in document with id 'OcTp4ZEBkEnqNYvJrysL'. Preview of field's value: '14 UNAVAILABLE: '",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Cannot index data directly into a field with a [script] parameter"
    }
  },
  "status": 400
}

APM Server version (apm-server version): Upgrade from pre-8.15 to 8.15+

Description of the problem including expected versus actual behavior: indexing errors affects apm-server upgraded from pre-8.15 to 8.15+, where reroute processor is used in logs-apm.error@custom.

Steps to reproduce:

Here's a minimal reproducible example without actually upgrading

PUT _index_template/logs-foo
{
  "data_stream": {},
  "index_patterns": ["logs-foo-*"], 
  "composed_of": ["logs-foo@mappings"],
  "ignore_missing_component_templates": ["logs-foo@mappings"],
  "template": {
    "settings": {
      "index.default_pipeline": "logs-foo@old-pipeline"
    }
  }, 
  "priority": 250
}

PUT _component_template/logs-foo@mappings
{
 "template": {
   "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "foo": {
          "type": "keyword"
        }
      }
    }
 } 
}

PUT _ingest/pipeline/logs-foo@old-pipeline
{
  "processors": [
    {
      "set": {
        "field": "foo",
        "value": "old"
      }
    },
    {
      "pipeline": {
        "name": "logs-foo@pipeline"
      }
    }
  ]
}

PUT _ingest/pipeline/logs-foo@pipeline
{
  "processors": [
  ]
}

POST logs-foo-default/_doc
{
  "@timestamp": "1970-01-01T00:00:00.000Z"
}

GET logs-foo-*/_search
{
  "fields": ["foo"]
}

PUT _index_template/logs-foo
{
  "data_stream": {},
  "index_patterns": ["logs-foo-*"], 
  "composed_of": ["logs-foo@mappings"],
  "ignore_missing_component_templates": ["logs-foo@mappings"],
  "template": {
    "settings": {
      "index.default_pipeline": "logs-foo@pipeline"
    }
  }, 
  "priority": 250
}

PUT _component_template/logs-foo@mappings
{
 "template": {
   "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "foo": {
          "type": "keyword",
          "script": {
            "source": "emit(\"new\")",
            "lang": "painless"
          }
        }
      }
    }
 } 
}


PUT _ingest/pipeline/logs-foo@pipeline
{
  "processors": [
    {"reroute": {
      "namespace": ["mynamespace"]
    }}
  ]
}

POST logs-foo-default/_doc
{
  "@timestamp": "1970-01-01T00:00:00.000Z"
}

Error from this example:

{
  "error": {
    "root_cause": [
      {
        "type": "document_parsing_exception",
        "reason": "[1:120] failed to parse field [foo] of type [keyword] in document with id 'kWE245EB4mZ7zYVt-HD2'. Preview of field's value: 'old'"
      }
    ],
    "type": "document_parsing_exception",
    "reason": "[1:120] failed to parse field [foo] of type [keyword] in document with id 'kWE245EB4mZ7zYVt-HD2'. Preview of field's value: 'old'",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Cannot index data directly into a field with a [script] parameter"
    }
  },
  "status": 400
}

carsonip avatar Sep 11 '24 22:09 carsonip

The reason behind the bug, e.g. upgrading from 8.13.2 to 8.15.0

  • apm-server always (unless apm-server.data_streams.namespace is set in apm-server.yml) issues the bulk request to logs-apm.error-default. It will be subject to ingest pipeline of current backing index of DS logs-apm.error-default.
  • 8.13.2 uses ingest pipeline to set error.grouping_name, while 8.15.0 sets it within a script in the mapping.
  • 8.13.2 uses ingest pipeline logs-apm.error-8.13.2, while 8.15.0 uses ingest pipeline logs-apm.error@default-pipeline. Both ingest pipeline references logs-apm.error@custom where the reroute processor is placed.
  • Let's say reroute processor creates logs-apm.error-some_custom_namespace
  • after 8.15 upgrade, logs-apm.error-default is not rolled over while logs-apm.error-some_custom_namespace is.

The state of the system after the above:

  • Current backing index of logs-apm.error-default is from 8.13.2, which sets error.grouping_name, then calls logs-apm.error@custom which performs reroute.
  • The event is reroute to logs-apm.error-some_custom_namespace, where backing index has the mapping from 8.15.0, where it sets error.grouping_name using a script in mapping (not ingest pipeline).
  • It is forbidden to use a script in mapping for error.grouping_name, when the _source already contains error.grouping_name, from the 8.13.2 ingest pipeline. Therefore, indexing fails with reason "Cannot index data directly into a field with a [script] parameter".

carsonip avatar Sep 11 '24 22:09 carsonip

after 8.15 upgrade, logs-apm.error-default is not rolled over while logs-apm.error-some_custom_namespace.

That seems to be the crux of the issue. I suspect the issue is that we're relying on lazy rollover, which I suppose does not trigger when rerouting. I'll check on this.

axw avatar Sep 12 '24 01:09 axw

Confirmed above hypothesis:

  1. Create an index template which sets a default ingest pipeline with reroute
  2. Create a data stream matching the index template
  3. Send a document to the data stream; it will be rerouted
  4. Create another index template with higher priority with the same index pattern, with no default ingest pipeline
  5. Rollover the source data stream with "lazy=true"
  6. Send a document to the data stream; it will still be rerouted
  7. Rollover the source data stream with "lazy=false"
  8. Send a document to the data stream; it will not be rerouted

axw avatar Sep 12 '24 02:09 axw

https://github.com/elastic/elasticsearch/issues/112781

axw avatar Sep 12 '24 02:09 axw

With https://github.com/elastic/elasticsearch/issues/112781 being fixed, is there anything else that needs to be implemented directly in the APM Server or does this just require some testing?

simitt avatar Dec 02 '24 07:12 simitt

We should test it, but I don't think there's anything more to implement for this.

axw avatar Dec 02 '24 08:12 axw

moving this from it105 to it106

raultorrecilla avatar Jan 10 '25 23:01 raultorrecilla

Given that 8.15 is now not supported anymore, what should still be tested is that this kind of error doesn't happen anymore once a deployment is upgraded to > 8.15. I suggest to run a quick manual test with (1) recreating the problem shared in the description (2) upgrade to 8.17.last, ingest data and confirm that the error is not occuring anymore

For step (1), likely 8.15.0 or 8.15.1 would be the best versions for reproducing the issue as several issues were fixed in the last minor versions of 8.15. (reach out if you need help reproducing).

simitt avatar Feb 20 '25 17:02 simitt

v8.15.0

I do get the same error as @carsonip outlined:

{
  "error": {
    "root_cause": [
      {
        "type": "document_parsing_exception",
        "reason": "[1:120] failed to parse field [foo] of type [keyword] in document with id 'NLh7JZUB20ThadfQMTAd'. Preview of field's value: 'old'"
      }
    ],
    "type": "document_parsing_exception",
    "reason": "[1:120] failed to parse field [foo] of type [keyword] in document with id 'NLh7JZUB20ThadfQMTAd'. Preview of field's value: 'old'",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Cannot index data directly into a field with a [script] parameter"
    }
  },
  "status": 400
}

v8.15.2

There is no available v8.15.1 on GCP. I also check the region that contains the SNAPSHOTS. Give the same issue

{
  "error": {
    "root_cause": [
      {
        "type": "document_parsing_exception",
        "reason": "[1:120] failed to parse field [foo] of type [keyword] in document with id '_n-AJZUB4qKmXkI0gAWF'. Preview of field's value: 'old'"
      }
    ],
    "type": "document_parsing_exception",
    "reason": "[1:120] failed to parse field [foo] of type [keyword] in document with id '_n-AJZUB4qKmXkI0gAWF'. Preview of field's value: 'old'",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Cannot index data directly into a field with a [script] parameter"
    }
  },
  "status": 400
}

v8.17.2

Issue seems to be resolved

{
  "_index": ".ds-logs-foo-default-2025.02.20-000001",
  "_id": "yqKBJZUBAXHl19DUaCaE",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 1,
  "_primary_term": 1
}

rubvs avatar Feb 20 '25 22:02 rubvs

I'll go ahead and close this.

rubvs avatar Feb 20 '25 22:02 rubvs

@rubvs just to confirm - you tested with upgrading a deployment from 8.15 to 8.17.2, not creating a new deployment on 8.17.2, correct?

simitt avatar Feb 21 '25 07:02 simitt

Thanks for double checking with me @simitt. Yes, let me outline my steps. I double checked it by also testing Andrew's method:

Test 1: Carson Approach

  1. Start with v8.15.0
  2. Run the steps outlined in this PR desc
  3. Verify that the error does occur when trying to reroute:
PUT _ingest/pipeline/logs-foo@pipeline
{
  "processors": [
    {"reroute": {
      "namespace": ["mynamespace"]
    }}
  ]
}

POST logs-foo-default/_doc
{
  "@timestamp": "1970-01-01T00:00:00.000Z"
}
  1. Manually upgrade to 8.17.2 via Kibana
  2. Re-add the doc and ensure it is indeed rerouted to the correct namespace.
# 1. Here, we are adding a doc to "default" namespace

POST logs-foo-default/_doc
{
  "@timestamp": "1970-01-01T00:00:00.000Z"
}

# 2. But the doc is successfully reroute to "mynamespace" namespace

{
  "_index": ".ds-logs-foo-mynamespace-2025.02.22-000001",
  "_id": "54rTL5UBhd3e6_WlhwSD",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 1,
  "_primary_term": 2
}
  1. That is to say, upgrading from 8.15.0 to 8.17.2 solved the issue
  2. Repeat starting with 8.15.2 & 8.16.4
  3. In 8.16.4, the issue is fixed WITHOUT needing an upgrade.

Test 2: Andrew Approach

  1. Start with v8.15.0
  2. Run all steps outlined my Andrew's approach in PR
  3. Verify that Step 6 does in fact STILL reroute to the namespace (foo in this case).
    • ".ds-demo-dataset-foo-2025.02.22-000002"
  4. Manually upgrade to 8.17.2 via Kibana
  5. Rerun Step 6 to verify the doc is NOT rerouted.
    • ".ds-demo-dataset-default-2025.02.22-000002"
  6. That is to say, upgrading from 8.15.0 to 8.17.2 solved the issue
  7. Repeat starting with 8.15.2 & 8.16.4
  8. In 8.16.4, the issue is fixed WITHOUT needing an upgrade.

rubvs avatar Feb 22 '25 22:02 rubvs

Thanks for confirmation and providing the testing details!

simitt avatar Feb 24 '25 08:02 simitt

Hi, I'm working on potentially adding a test case for this in functional tests: https://github.com/elastic/apm-server/issues/14061. May I know what would be the reproducible steps of the original error? Simply create reroute processor in logs-apm.error@custom and upgrade, then check ingest?

ericywl avatar Mar 20 '25 06:03 ericywl

@ericywl pretty much, yes. High level, the problem was that in the impacted versions, there was no automatic rollover triggered on update due to the lazy rollover. With lazy rollover usually the rollover happens on first write request, but this wasn't triggered with the reroute processor, therefore no rollover happened and new settings/mappings weren't picked up.

simitt avatar Mar 20 '25 07:03 simitt