tantivy
tantivy copied to clipboard
Add support for `interval` field for Date histogram
As described in https://github.com/quickwit-oss/quickwit/issues/3250, we need to support the OpenSearch API too...
To reach that for the date histogram, we need to support the interval
field in addition to the fixed_interval
field (there is a calendar_interval
field but we don't support it yet).
We discussed this issue with @PSeitz, we have 2 ways of doing it:
- having separate structs to deserialize the JSON formatted date histogram aggregation
- adding a new field
interval
on theDateHistogramAggregationReq
and adding some validation/preprocessing so that we:
- raise an error if both
interval
andfixed_interval
orinterval
andcalendar_interval
are present - if
interval
is present, use it to populatefixed_interval
as we only supportfixed_interval
for now.
@PSeitz Can you take care of this?
We can also discuss long term plans to ditch the JSON representation of aggregations in tantivy, and just have agregation definition "objects" in there. We would then have a JSON -> Aggregation definition object translation layer in quickwit instead. However we need to support this for airmail.
Elastic search 7.2:
[7.2] Deprecated in 7.2.
interval
field is deprecated Historically both calendar and fixed intervals were configured in a single interval field, which led to confusing semantics. Specifying 1d would be assumed as a calendar-aware time, whereas 2d would be interpreted as fixed time. To get "one day" of fixed time, the user would need to specify the next smaller unit (in this case, 24h).
Mapping interval
to fixed_interval
is only an option for values that are fixed interval in es, otherwise that would break elastic search compatibility.
Elastic search and OpenSearch should not be mixed in the same API endpoint, if they are not compatible, which seems to be the the case for date histogram. So I think we need an explicit OpenSearch endpoint in quickwit.
We don't need separate structs, it could be just a union which then gets validated for elastic search or open search compatibility.
As of now, I don't think we gain something by ditching the JSON representation of aggregations in tantivy, but it would cause a lot of code duplication.
oh yes sorry for my lack of accuracy, @PSeitz you indeed mentioned the solution of having a dedicated endpoint for OpenSearch.
Mapping interval to fixed_interval is only an option for values that are fixed interval in es, otherwise that would break elastic search compatibility.
Agreed. But we only need to handle fixed intervals in interval
for airmail
. We could support only that.
Elastic search and OpenSearch should not be mixed in the same API endpoint, if they are not compatible, which seems to be the the case for date histogram. So I think we need an explicit OpenSearch endpoint in quickwit.
Looking at the history of Elasticsearch, I don't see this as an incompatibility but as a support for older elasticsearch version. And when I look at this opensearch issue I would expect them to remove the interval support in the future (though it can take some time).
I think it's overkill to distinguish OpenSearch and Elasticsearch API just for this interval
field. That being said, I acknowledge that it would be good to check all endpoints needed for airmail and check if we have other issues like this one. I'm going to do that tonight.
@PSeitz Just to confirm that I did not find other queries with an API difference between Elasticsearch en OpenSearch for the airmail
project.
I had a look at the following queries: query_string
, exists
, term
, percentiles
, multi_match
.
@PSeitz do you need more info on your side?
Looking at the OpenSearch docs, it's unclear if they handle calendar aware intervals or not.
So should we implement the interval
halfway, only with values higher than 1?
For airmail
, we need to support: \d+s
, \d+m
, \d+h
.
I think we can only support what we support currently for the fixed_interval
. On OpenSearch
, you can have a look at the code here.
1m
is date-aware in elastic search, this is incompatible with the fixed interval handling we have. Most users are familiar with elastic search and they would all run into the same API issue (probably unknowingly for some time).
2m
etc. is fixed interval on the interval
field.
you mean 1d
was date-aware in elasticsearch? So to avoid issues, we could only support ms
, s
, m
, h
and avoid the problematic d
.
1m
, 1h
, 1d
, 1w
, 1M
, 1q
, 1y
. They are all date aware. All other values are not, e.g. 24h
, 2w
etc.
So we could implement the interval halfway, only for values higher than 1. It's a weird API, and will probably cause some user issues, but at least not returning wrong results.
I think it would be better if airmail could call the new API.
Ok, I have just understood how this was broken before... I thought the issue was only on 1d
when reading the comment you quoted previously.
Moreover, I saw yesterday in the OpenSearch code that it should be already possible to use the fixed_interval
parameter. It's just not documented. This old spec is crazier than I thought... Let's try to push back this on the client side.