sql icon indicating copy to clipboard operation
sql copied to clipboard

[BUG] MAX/MIN functions does not work properly with array values

Open normanj-bitquill opened this issue 1 year ago • 6 comments

What is the bug? If MAX or MIN are applied to a field with array values, then the MAX or MIN of any element in any of the array values is returned.

Consider an index with the following data:

{"x": 1, "y": [1, 2]}
{"x": 2, "y": [3, 4]}
{"x": 3, "y": [1, 5]}
{"x": 4, "y": [1, 2]}
{"x": 5, "y": [2, 3]}

and this query:

SELECT MAX(y) FROM test3;
5

or this query:

SELECT MIN(y) FROM test3;
1

For MAX, the expectd result is [3, 4] and for MIN, the expected result is [1, 2]. The comparison should be performed element by element and preserve the entire array value.

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Create a new index and load the data above
  2. Run the query above

What is the expected behavior? Use array comparisons to compare the array values and return the MAX or MIN array.

What is your host/environment?

  • OS: Mac OS X
  • Version: 3.0 code base
  • Plugins: SQL plugin

Do you have any screenshots? N/A

Do you have any additional context? Issue #1300 had a change recently merged in that allows array values to be used in query evaluation and in the result set.

normanj-bitquill avatar Oct 29 '24 17:10 normanj-bitquill

This appears to be due to how the OpenSearch engine calculates the min and max for arrays values.

POST test3/_search
{
  "size": 0,
  "aggs": {
    "y_min": {
      "min": {
        "field": "y"
      }
    }
  }
}
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "y_min": {
      "value": 1
    }
  }
}
POST test3/_search
{
  "size": 0,
  "aggs": {
    "y_max": {
      "max": {
        "field": "y"
      }
    }
  }
}
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "y_max": {
      "value": 5
    }
  }
}

normanj-bitquill avatar Oct 30 '24 17:10 normanj-bitquill

For reference, this is how PostgreSQL behaves:

SELECT * FROM with_array;
 x |   y
---+-------
 1 | {1,2}
 2 | {3,4}
 3 | {1,5}
 4 | {1,2}
(4 rows)
SELECT MIN(y) FROM with_array;
  min
-------
 {1,2}
(1 row)
SELECT MAX(y) FROM with_array;
  max
-------
 {3,4}
(1 row)

normanj-bitquill avatar Oct 30 '24 17:10 normanj-bitquill

Thanks for rasing issue. OpenSearch/Lucene does not support ARRAY data type. Lucene allows adding multiple values for the same field name. for example. when calcualte min/max of field y, OpenSeach read all the value of y, and calcuate metrics, the result is min(y)=1, max(y)=5.

field,value
y,1
y,2
y,3
y,4
y,1
y,5
y,1
y,2
y,2
y,3

penghuo avatar Oct 30 '24 18:10 penghuo

Related issue in opensearch-core. https://github.com/opensearch-project/OpenSearch/issues/16420

penghuo avatar Oct 30 '24 18:10 penghuo

One solution is if array_field is used in aggregation, we should do post-processing, instead of rewrite as DSL aggregation query.

penghuo avatar Oct 30 '24 18:10 penghuo

@penghuo I'm not sure the SQL plugin can easily know that the field is an array field. By default, the mapping will only contain information about the element types.

GET my_index/mapping
{
  "properties": {
    "id": {
      "type": "long"
    },
    "array_field": {
      "type": "long"
    }
  }
}

If the user adds something to the mapping, we could know that it is an array. https://trino.io/docs/current/connector/elasticsearch.html#array-types

Is there anything similar currently in OpenSearch?

normanj-bitquill avatar Oct 30 '24 18:10 normanj-bitquill

@normanj-bitquill agree, https://github.com/opensearch-project/OpenSearch/issues/16420 discuss similar approach.

penghuo avatar Oct 31 '24 22:10 penghuo

[Catch All Triage - 1, 2, 3, 4, 5]

andrross avatar Nov 18 '24 17:11 andrross

@normanj-bitquill @acarbonetto I've seen the approach suggested here.

IMO we can utilize mapping._meta#field.name.type:array location for the user to explicitly hint the specific mapping field is expected to be treated and an array.

indexing: ingestion a mismatched value will not cause an error (if this field only contains a single value) but the query engine (PPL/SQL) should treat this value as a single list.

IMO in the first iteration we can have the plugin be aware of the mapping._meta#field.name.type:array and actually do the aggregation in the client's scope (not pushing down).

YANG-DB avatar Dec 16 '24 19:12 YANG-DB