sql icon indicating copy to clipboard operation
sql copied to clipboard

[BUG] PPL cannot properly handle fields with mixed keyword and text data types across multiple indices

Open penghuo opened this issue 2 months ago • 4 comments

What is the bug? PPL cannot properly handle fields with mixed keyword and text data types across multiple indices. Currently, when PPL merges field types, it randomly selects one. For example, if the msg field is mapped as both text and keyword across indices, PPL may select keyword. As a result, when PPL attempts to fetch the msg value from doc_values on each shard, it will return null for documents where msg is defined as a text field, since text fields do not have doc_values.

How can one reproduce the bug?

  • Prepare dataset, create 2 indices, msg field type is Text in index1, and keyword in index2.
###
PUT {{baseUrl}}/log00001
Content-Type: application/x-ndjson

{
  "mappings": {
    "properties": {
      "msg": {"type": "text"}
    }
  }
}

###
PUT {{baseUrl}}/log00002
Content-Type: application/x-ndjson

{
  "mappings": {
    "properties": {
      "msg": {"type": "keyword"}
    }
  }
}

###
POST {{baseUrl}}/log00001/_bulk
Content-Type: application/x-ndjson

{"index": { "_id": 1 }}
{"msg": "status=200", "index": 1}

###
POST {{baseUrl}}/log00002/_bulk
Content-Type: application/x-ndjson

{"index": { "_id": 1 }}
{"msg": "status=200", "index": 2} 
  • Run query, we expected get 2 results. but only get 1 results back
### Expected
POST {{baseUrl}}/_plugins/_ppl/
Content-Type: application/x-ndjson

{
  "query": "source=log*"
}

  "datarows": [
    [
      "status=200",
      1
    ],
    [
      "status=200",
      2
    ]
  ]

### Unexpected
POST {{baseUrl}}/_plugins/_ppl/
Content-Type: application/x-ndjson

{
  "query": "source=log* | rex field=msg 'status=(?<statusCode>\\d+)' | where statusCode=200"
}

  "datarows": [
    [
      "status=200",
      2,
      "200"
    ]
  ]

What is the expected behavior? Return 2 results

What is your host/environment?

  • OS: [e.g. iOS]
  • Version [e.g. 22]: Main
  • Plugins

penghuo avatar Oct 24 '25 15:10 penghuo

Additional Analysis and Information

The issue arises when executing PPL queries against wildcard indices with conflicting field mappings:

  POST {{baseUrl}}/_plugins/_ppl/
  Content-Type: application/x-ndjson

  {
    "query": "source=log* | rex field=msg 'status=(?<statusCode>\\d+)' | where statusCode=200"
  }

Expected Result: Multiple matching documents across indicesActual Result: Limited results due to silenced shard failures:

  {
    "datarows": [
      [
        "status=200",
        2,
        "200"
      ]
    ]
  }

By default, OpenSearch silences partial shard failures and returns available results. To expose the underlying errors, disable partial results:

  PUT {{baseUrl}}/_cluster/settings
  Content-Type: application/x-ndjson

  {
    "transient": {
      "search.default_allow_partial_results": "false"
    }
  }

Root Cause Analysis

  1. Non-Deterministic Mapping Merge

When querying wildcard indices, OpenSearch merges field mappings from all matching indices. The critical issue: when field types conflict, the selection is non-deterministic due to random ordering of the index mapping list. Code Link

  • Impact: A field may be treated as int or keyword randomly, depending on which index mapping is processed last
  1. Calcite Script Engine Optimization Conflict

The Calcite script engine optimizes text field performance by treating them as keywords when multifields are present, accessing doc_values for better performance. Code Link

  • Problem: When the merged mapping indicates keyword type but an index actually has text-only fields, doc_values access fails.

    Concrete Example

  Index: logs-1
  "body": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
  Index: logs-2
  "body": {
    "type": "text",
    "norms": false
  }

The Problem: Even though both indices define body as text, the presence of multifields in logs-1 causes intermittent failures. During mapping merge:

  1. If logs-1 mapping is selected → Script engine assumes keyword type available
  2. Script engine attempts doc_values access for performance
  3. When processing logs-2 documents → Failure occurs (no keyword field/doc_values)

Action Items apart from bug fix

  • Run Integration Tests (ITS) with search.default_allow_partial_results=false to uncover any existing partial shard failure scenarios across the codebase

vamsimanohar avatar Oct 24 '25 17:10 vamsimanohar

I will look into it.

xinyual avatar Oct 27 '25 06:10 xinyual

@penghuo have any conclusion on this? do we need more discussions?

LantaoJin avatar Oct 27 '25 07:10 LantaoJin

  • Short-term solution: Implement a fast-fail mechanism when incompatible field index types are detected (e.g., keyword vs. text).
  • Long-term solution:
    • Option 1: Enable PPL to execute the query plan on each shard independently. This allows PPL to tolerate incompatible field index types across indices. For example, in an expression like REX(field=msg, 'status=(?<statusCode>\d+)'), PPL can generate a docValues fetch operation on shards where the field type is keyword, and a _source fetch operation on shards where the field type is text.

penghuo avatar Nov 05 '25 17:11 penghuo