data-prepper icon indicating copy to clipboard operation
data-prepper copied to clipboard

[BUG] Dots Discovered Key Names

Open Conklin-Spencer-bah opened this issue 1 year ago • 4 comments

Describe the bug Keys with "." in them are not able to be processed.

When ingesting logs from FluentBit -> S3 -> SQS -> Data Prepper / OSIS -> OpenSearch any key that has a dot "." in it is throwing an error on ingestion, see below error from OSIS. I believe this is because the Kubernetes metadata in labels contains dots.

2024-09-24T14:08:46.611 [s3-log-pipeline-sink-worker-2-thread-2] WARN  org.opensearch.dataprepper.plugins.sink.opensearch.BulkRetryStrategy - operation = Index, status = 400, error = can't merge a non object mapping [kubernetes.labels.app] with an object mapping

The JSON blob looks as such

    "labels": {
      "app": "fooservice",
      "app.kubernetes.io/component": "foo",
      "app.kubernetes.io/instance": "foo-in-cluster",
      "app.kubernetes.io/managed-by": "Helm",
      "app.kubernetes.io/name": "fooservice",
      "app.kubernetes.io/version": "somelonghash",

If these labels aren't in the log ingestion succeeds. One challenge is that the labels vary from service to service so predicting what they will be is difficult. It would be preferable if there was a way to say "If the key found has a "." (or some other char) substitute it with "_" or whatever the user chooses.

It is possible that this is able to be done and I am unaware on how to do so.

To Reproduce

Attempt to process and ingest a log file to OpenSearch with Data Prepper with a log that has Keys that contain dots "."

Such as:

    "labels": {
      "app": "fooservice",
      "app.kubernetes.io/component": "foo",
      "app.kubernetes.io/instance": "foo-in-cluster",
      "app.kubernetes.io/managed-by": "Helm",
      "app.kubernetes.io/name": "fooservice",
      "app.kubernetes.io/version": "somelonghash",

Expected behavior The key in double quotes is processed as a key even when dots are present.

Environment (please complete the following information):

  • AWS Managed OpenSearch Ingestion Service

Additional context Seems this is related and was merged with a Fix. But it is unclear on how to resolve this issue.

https://github.com/opensearch-project/data-prepper/issues/450

Conklin-Spencer-bah avatar Sep 24 '24 17:09 Conklin-Spencer-bah

Thanks for reporting this issue. This is actually a conflict between different field types in OpenSearch. During indexing, the document is rejected because of it. The issue arises, because OpenSearch interprets dots "." in field names as nested JSON objects. Let me take your sample data and reduce it a little to illustrate the issue.

Let's say, we want to index just the following document in OpenSearch:

{
  "labels": {
    "app": "fooservice",
    "app.kubernetes.io/component": "foo"
  }
}

OpenSearch expands the key app.kubernetes.io/component and gets a conflict:

{
  "labels": {
    "app": "fooservice",
    "app": {                            // Error, is "app" a string or an object?
      "kubernetes": {
        "io/component": "foo"
      }
    }
  }
}

This issue happens a lot, when logging K8s labels or annotations. It would also occur, if Fluent Bit wrote to OpenSearch directly and is not a bug in DataPrepper per se. You can work around this issue, by replacing the dots "." with underscores "_" using a small Lua script in Fluent Bit. We have developed this snippet for our own use-cases. Such a transformation is usually known by the name dedotting in case you want to google it.

Data Prepper faces a similar issue for OpenTelemetry attributes. Here its processors dedot the attribute names by replacing certain dots "." by "@". In that case, the dedotting is hard-coded into the OpenTelemetry processors of Data Prepper. I am not that experienced with the generic Data Prepper processors, to give an example using those. The main problem to me is, that you would not want to list all field names, that should be dedotted in the pipeline configuration. In your example, it could be applied to all fields under label, but it might be different for others.

Note, that any dedotting procedure increases the divide between deployment and observability due to the altered names. Unfortunately, there is no easy way around this. The unfolding of dotted names is a major feature of OpenSearch.

KarstenSchnitter avatar Sep 25 '24 21:09 KarstenSchnitter

Thanks for the lead. For whatever reason doing this fixed it? All the labels and timestamp will still show up in OpenSearch. So it is somewhat puzzling.

  - delete_entries:
        with_keys: ["/kubernetes/labels/app", "ts"]

Conklin-Spencer-bah avatar Oct 02 '24 20:10 Conklin-Spencer-bah

@KarstenSchnitter , Thank you for the detailed comment. Do you think having a dedot processor would help here? That could be a useful feature to help with situations like this, which are somewhat common.

@Conklin-Spencer-bah , I think deleting /kubernetes/labels/app is working because you are deleting this string value. With this OpenSearch is creating documents with a structure similar to the following I expect:

{
  "kubernetes": {
    "labels" : {
      "app" : {
        "kubernetes" : {
          "io" : {
            "component" : "foo",
            "instance" : "foo-in-cluster",
            "managed-by" : "Helm",
            ...
          }
        }
      }
    }
  }
}

This is also why you needed to delete app. OpenSearch had decided that app is an object, but one app value is a string.

dlvenable avatar Oct 03 '24 15:10 dlvenable

Somewhat relatedly, we are working on dynamic key renaming in #4849. The approach in there is to support renaming keys by pattern. Still, dedotting seems a common enough pattern to possibly warrant its own processor.

dlvenable avatar Oct 03 '24 15:10 dlvenable