elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Match query on date property with format 'uuuu' (year) acts as a range query for years below 1971

Open yrodiere opened this issue 3 years ago • 1 comments

Elasticsearch Version

8.4.2

Installed Plugins

No response

Java Version

bundled

OS Version

Linux yrodiere.redhat 5.19.8-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Sep 8 19:02:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

Since Elasticsearch 8.4.2, running a match query on a field of type date with format uuuu will give strange results for years below 1971, similar to running a range query with the range [<your year>, 1970].

Yes, that's very weird.

This bug was not present in Elasticsearch 8.4.1.

Steps to Reproduce

# Just to make it easier to re-run the test
curl -XDELETE localhost:9200/mytest/\?pretty 1>&2 2>/dev/null || true
# Property of type `date` with format `uuuu`
curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest/\?pretty -d'{"mappings":{"properties":{"myyear":{"type":"date","format":"uuuu"}}}}'
# Index year 1971
curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest/_doc/1\?pretty\&refresh=true -d'{"myyear":"1971"}'
# Index two different years below 1971
curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest/_doc/2\?pretty\&refresh=true -d'{"myyear":"1970"}'
curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest/_doc/3\?pretty\&refresh=true -d'{"myyear":"1969"}'
curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest/_doc/4\?pretty\&refresh=true -d'{"myyear":"1500"}'
curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest/_doc/6\?pretty\&refresh=true -d'{"myyear":"0050"}'
curl -XPUT -H "Content-Type: application/json" localhost:9200/mytest/_doc/7\?pretty\&refresh=true -d'{"myyear":"-42"}'
# Search for year 1970: only 1970 matches
curl -XPOST -H "Content-Type: application/json" localhost:9200/mytest/_search\?pretty -d'{"query":{"match":{"myyear":{"query":"1970"}}}}'
# Search for year 1969: both 1969 and 1970 match (???) but not 1971
curl -XPOST -H "Content-Type: application/json" localhost:9200/mytest/_search\?pretty -d'{"query":{"match":{"myyear":{"query":"1970"}}}}'
# Search for year 1500: 1500, 1969 and 1970 match (???) but not 1971
curl -XPOST -H "Content-Type: application/json" localhost:9200/mytest/_search\?pretty -d'{"query":{"match":{"myyear":{"query":"1500"}}}}'
# Search for year 0001: 0050, 1500, 1969 and 1970 match (???) but not 1971
curl -XPOST -H "Content-Type: application/json" localhost:9200/mytest/_search\?pretty -d'{"query":{"match":{"myyear":{"query":"0001"}}}}'

Logs (if relevant)

(Incorrect) result of the last two requests in the script above with Elasticsearch 8.4.2:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "mytest",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "myyear" : "1970"
        }
      },
      {
        "_index" : "mytest",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "myyear" : "1969"
        }
      },
      {
        "_index" : "mytest",
        "_id" : "4",
        "_score" : 1.0,
        "_source" : {
          "myyear" : "1500"
        }
      }
    ]
  }
}
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "mytest",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "myyear" : "1970"
        }
      },
      {
        "_index" : "mytest",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "myyear" : "1969"
        }
      },
      {
        "_index" : "mytest",
        "_id" : "4",
        "_score" : 1.0,
        "_source" : {
          "myyear" : "1500"
        }
      },
      {
        "_index" : "mytest",
        "_id" : "6",
        "_score" : 1.0,
        "_source" : {
          "myyear" : "0050"
        }
      }
    ]
  }
}

(Correct) result of the last two requests in the script above with Elasticsearch 8.4.1:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "mytest",
        "_id" : "4",
        "_score" : 1.0,
        "_source" : {
          "myyear" : "1500"
        }
      }
    ]
  }
}
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

yrodiere avatar Sep 21 '22 12:09 yrodiere

So, actually, this "range" thing probably affects all years, because years above 1971 will never get a match... That makes me think Elasticsearch tries to match e.g. range [1971, 1970] for year 1971, which never matches anything.

yrodiere avatar Sep 21 '22 12:09 yrodiere

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine avatar Sep 21 '22 18:09 elasticsearchmachine

Seems like all types of queries against date fields are broken.

"year": {
    "type": "date",
    "format": "yyyy",
    "ignore_malformed": true
},

If I have a document indexed with year set to 1900 then all queries with year set to a value less than or equal to 1900 will produce a match. This did not happen in previous versions of elastic.

GET foobar/_search
{
    "query": {
        "bool": {
            "filter": [
                {
                    "term": {
                        "year": "1000"
                    }
                }
            ]
        }
    }
}

Here's the output from the explain API:

{
  "_index": "foobar",
  "_id": "1",
  "matched": true,
  "explanation": {
    "value": 0,
    "description": "ConstantScore(year:[-30610224000000 TO 86399999])^0.0",
    "details": []
  }
}

So instead of giving me documents where the year is 1000 it's giving me documents where the year is between 1000 and 1972.

freost avatar Sep 26 '22 13:09 freost

Can add some observations:

GET somestuff/_search
{
    "query": {
        "bool": {
            "filter": [
                {
                    "range": {
                        "start_year": {
                            "gte": "1900"
                        }
                    }
                }
            ]
        }
    }
}

is working but when I change to gt it only gives results for years higher then 1970. Another observation is that

GET somestuff/_search
{
    "query": {
        "bool": {
            "filter": [
                {
                    "range": {
                        "end_year": {
                            "lte": "1900"
                        }
                    }
                }
            ]
        }
    }
}

does not work at al and gives results including 1970 but lt is filtering as intended. The 1970 part is suspiciously similar to the unix epoch?

eirikboy avatar Sep 27 '22 08:09 eirikboy

Thanks for reporting this! Can you post the output of the query with ?explain=true, which should give us an idea of how the query is being parsed to lucene query objects.

romseygeek avatar Sep 27 '22 09:09 romseygeek

_explanation for gte that is working

"_explanation": {
    "value": 0,
    "description": "ConstantScore(source_start_year:[-2208988800000 TO 9223372036854775807])^0.0",
    "details": []
}

which gives

datetime.datetime.fromtimestamp(-2208988800000/1000)
datetime.datetime(1900, 1, 1, 1, 0)

_explanation for gt that fails

"_explanation": {
    "value": 0,
    "description": "ConstantScore(source_start_year:[86400000 TO 9223372036854775807])^0.0",
    "details": []
}

which gives

datetime.datetime.fromtimestamp(86400000/1000)
datetime.datetime(1970, 1, 2, 1, 0)

_explanation for lte that fails

"_explanation": {
    "value": 0,
    "description": "ConstantScore(source_end_year:[-9223372036854775808 TO 86399999])^0.0",
    "details": []
}

which gives

datetime.datetime.fromtimestamp(86399999/1000)
datetime.datetime(1970, 1, 2, 0, 59, 59, 999000)

_explanation for lt that is working

"_explanation": {
    "value": 0,
    "description": "ConstantScore(source_end_year:[-9223372036854775808 TO -2208988800001])^0.0",
    "details": []
}

which gives

datetime.datetime.fromtimestamp(-2208988800001/1000)
datetime.datetime(1900, 1, 1, 0, 59, 59, 999000)

eirikboy avatar Sep 27 '22 10:09 eirikboy

I have updated my original comment with the explain output.

freost avatar Sep 27 '22 10:09 freost

So yyyy is broken but YYYYworks. Noticed the difference in this pr https://github.com/elastic/elasticsearch/commit/330c97e6d0a369a632779a5f358d3329adfbbdfd#diff-27cb6b4759805385f2b88731b2071c644258c9842be25536d2c7b31314f1348d

to replicate issue:

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "date": {
        "type": "date",
        "format": "yyyy"
      }
    }
  }
}

PUT my-index-000001/_doc/1
{ "date": "1920" }

GET my-index-000001/_search?explain=true
{
  "query": {
    "bool": {
      "filter": [
        {"term": {
          "date": "1920"
        }}
      ]
    }
  }
}

"_explanation": {
    "value": 0,
    "description": "ConstantScore(date:[-1577923200000 TO 86399999])^0.0",
    "details": []
}

and the one that works as expected

PUT my-index-000002
{
  "mappings": {
    "properties": {
      "date": {
        "type": "date",
        "format": "YYYY"
      }
    }
  }
}

PUT my-index-000002/_doc/1
{ "date": "1920" }

GET my-index-000002/_search?explain=true
{
  "query": {
    "bool": {
      "filter": [
        {"term": {
          "date": "1920"
        }}
      ]
    }
  }
}

"_explanation": {
    "value": 0,
    "description": "ConstantScore(date:[-1578182400000 TO -1578096000001])^0.0",
    "details": []
}

eirikboy avatar Sep 27 '22 14:09 eirikboy

@pgomulka do you this this might be a side effect of #89693

romseygeek avatar Sep 27 '22 15:09 romseygeek

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine avatar Sep 27 '22 16:09 elasticsearchmachine

Will this https://github.com/elastic/elasticsearch/pull/90458 be backported to 8,4? :)

eirikboy avatar Sep 29 '22 11:09 eirikboy

Not sure about 8.4 but most likely 8.5 will get this

pgomulka avatar Sep 29 '22 11:09 pgomulka

Is there an ETA on this fix? It's a critical part of our solution since we’re searching in historic documents without complete dates.

eirikboy avatar Sep 29 '22 11:09 eirikboy

we cannot share ETA for this fix. I can assure you that we are actively working on a fix to get merged asap (subscribe to that PR changes) sadly the only workaround would be to use exact dates and don't rely on fields being defaulted so sth like

GET somestuff/_search
{
    "query": {
        "bool": {
            "filter": [
                {
                    "range": {
                        "start_year": {
                            "gte": "1900-01-01T00:00:00.000"

                        }
                    }
                }
            ]
        }
    }
}

let me know if that helps

pgomulka avatar Sep 29 '22 12:09 pgomulka