elasticsearch
elasticsearch copied to clipboard
categorize_text takes a long time on small data sets with low cardinality
I am running a categorize_text
aggregation on a relatively small data set, and the response time is about 18s for 185k docs, and three aggregations, so each aggregation processes about 30k docs/s. I briefly spoke to @droberts195 about this and we are not sure if this is expected. Each aggregation (after removing/adding) seems to take up a third of the response time.
In this case, the aggregations run over process.executable.text
, user.name.text
and host.os.name.text
, which are match_only_text
fields. Some observations:
- These are low cardinality fields - 1 or 2 values per field
- The aggregation returns no results
- Running the aggregation over the keyword siblings does return results, and take up the same amount of time.
My questions:
- is 30k docs per second per shard expected?
- can we speed it up (significantly)?
- is not returning any results expected?
- is performance affected by the cardinality of the data set, or only the document count (documents with values)? what other factors come into play?
- how do we determine whether results are statistically significant?
Context here is that I'm trying to figure out if we can make the log rate analysis API realtime (e.g. instead of a double-digit seconds response). The goal here is to get statistically significant results within <=2.5s from the API, and I'm trying to figure out what options we have to get there, e.g. use a sampler agg and re-poll with a greater probability if the results are not significant, etc.
Pinging @elastic/ml-core (Team:ML)
@edsavage please could you investigate this one and see if there's some silly duplicated processing happening or processing that's supposed to be one-off singleton initialisation but isn't.
cc @arisonl
I've attempted to reproduce this locally (on my mac) in order to profile the aggregation (using async-profiler
integrated with IntelliJ IDEA), and while I'm not seeing anywhere as near a long response time as 3s, the profiler does show up some areas that might be useful to investigate further.
The dataset I used was 472898 documents from one of the filebeat-nginx
indices contained in the filebeat
snapshot. I ran a query containing multiple categorize_text
aggregations, on low cardinality fields:
GET /filebeat-nginx-elasticco-2017.02.01/_search?filter_path=aggregations
{
"aggs": {
"categories1": {
"categorize_text": {
"field": "nginx.access.url"
}
},
"categories2": {
"categorize_text": {
"field": "nginx.access.user_agent.os_name"
}
},
"categories3": {
"categorize_text": {
"field": "nginx.access.user_agent.name"
}
}
}
}
Viewing the CPU Samples
profiler output:
shows that the incrementToken
stack is taking up ~ 66% of the time of its parent computeCategory
similarly for the Memory Allocations
:
While these observations haven't provided conclusive evidence of a smoking gun, they perhaps do indicate areas of the code worthy of closer inspection.
@edsavage do you have benchmarks, so I can have an understanding of how fast it is supposed to be?
@dgieselaar Sorry, no I haven't come across any benchmarks for the categorize_text
aggregation.
Can I clarify a few points that may help me better understand the issue?
- What version of ES are you testing against?
- Is the cluster in cloud, or are you running locally?
- What are the specs of the cluster? Number of nodes, CPUs memory allocation etc.
- Details of the data set you're using? I've found an index
.ds-filebeat-8.2.0-2022.06.07-000082
which I think is the one you've been using, but it contains 704246 docs, not 185K - The job configuration? I've been using variations on:
POST /.ds-filebeat-8.2.0-2022.06.07-000082/_async_search?size=0
{
"aggs": {
"categories.message": {
"categorize_text": {
"field": "message"
}
},
"categories.process_executable_text": {
"categorize_text": {
"field": "process.executable"
}
},
"categories.user_name_text": {
"categorize_text": {
"field": "user.name"
}
},
"categories.host_os_name_text": {
"categorize_text": {
"field": "host.os.name"
}
}
}
}
While I'm still digging in to your questions regarding speed and performance I believe I do have an answer to:
is not returning any results expected?
In the case of match_only_text
fields, yes this is expected as fields of this type only have limited support for aggregations, see - https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html#match-only-text-field-type
@edsavage the cluster unfortunately has gone down since. It was a cloud cluster with data from Eden. Note that I did not use async search and I used a range query on @timerange
.
re:
In the case of match_only_text fields, yes this is expected as fields of this type only have limited support for aggregations, see - elastic.co/guide/en/elasticsearch/reference/current/text.html#match-only-text-field-type
Does that mean we should never run this on match_only_text
fields? cc @walterra as I think the log rate analysis API does include match_only_text
fields
Good morning @dgieselaar, thanks for the extra info.
Does that mean we should never run this on
match_only_text
fields? cc @walterra as I think the log rate analysis API does includematch_only_text
fields
Yes, you should avoid running the categorize_text
agg on match_only_text
fields. This is because fields of this type are not tokenized, which is essential for the agg to work.
A correction is needed to my statement regarding match_only_text
text fields @dgieselaar .
They are suitable for running the categorize_text
agg on. However they must be the primary field type when using multi-field
types. For example the mapping
"name": {
"type": "match_only_text",
"fields": {
"raw": {
"type": "keyword",
"ignore_above": 1024
}
}
}
would be suitable for running the categorize_text
agg on field name
, but not in this case on name.raw
. The reason is that searching for the literal primary field name
in the _source
doc is possible, but not name.raw
.
And this leads to an explanation for the slowness of the aggregation, even when no results are returned. The aggregation performs a search in the full _source
map for every document in the index, for every field name referred to in the aggregation config. Which results in effectively a constant time overhead.
Without a fundamental change to the aggregation itself the best advice I can give is to reiterate the advice from https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-categorize-text-aggregation.html, i.e.
Re-analyzing large result sets will require a lot of time and memory. This aggregation should be used in conjunction with [Async search](https://www.elastic.co/guide/en/elasticsearch/reference/current/async-search.html). Additionally, you may consider using the aggregation as a child of either the [sampler](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-sampler-aggregation.html) or [diversified sampler](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-diversified-sampler-aggregation.html) aggregation. This will typically improve speed and memory use.
I hope that goes some way to explain the behaviour you've been experiencing. Please do let me know if I can be of any more help.
I had one more thought about this.
Does synthetic source change anything? If categorize_text
unconditionally gets field values from _source
does that still work if synthetic source is enabled? If not then it would probably be possible to have the aggregation intelligently look in doc values fields as an alternative, which might mean looking in a multi-field’s doc value.
Isn’t there a plan to switch logs indices over to synthetic source on the grounds that the benefits outweigh the costs? But since being able to categorize logs is pretty important it’s worth checking the effect on categorization.
Does synthetic source change anything?
Thanks for the advice kind stranger.
The code in question is
https://github.com/elastic/elasticsearch/blob/eea8d1d65986e95a251a57c1c0f192aca160caca/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/CategorizeTextAggregator.java#L167
This exhibits the behaviour that it can retrieve the source for a "parent" multi-field - e.g. host.os.name
(specified in sourceFilter
) but returns null for a child multi-field e.g. host.os.name.text
. The question is then how does getSource
behave when synthetic source is enabled - something I'll look into now.
Checking the effect of synthetic source on the categorize_text
aggregation.
I reindexed .ds-filebeat-8.2.0-2022.06.07-000082
as synthetic-filebeat-8.2.0-2022.06.07-000082
, adding
"_source": {
"mode": "synthetic"
}
to the mappings (a handful of fields required massaging to be compatible with the synthetic
mode, in particular fields that were configured as nested
).
There were no changes in behaviour related to the categorize_text
agg. The aggregation continues to return valid results for primary multi-field values and, as before, fails to return results for nested multi-field values.
Many thanks for your investigations so far! Finally found the time to look into why we're choosing these fields for log rate analysis in the first place. We had some code in place that checks if a field is mapped both as keyword
and text
and it should only pick keyword
. Turns out there was a bug in that code that still added the text fields as suitable for analysis where it shouldn't. I have a PR up that should fix this: https://github.com/elastic/kibana/pull/179699
Some brief testing shows that an uncached analysis of the pgbench
example dataset comes down from previously ~8s to ~1-2s once we no longer run these unnecessary queries!
The bug was that we had a check that correctly identified for example myField.keyword
and myField
as text to keep myField.keyword
, but we missed checking the other way around to pick myField
instead of myField.text
.
@walterra nice!
@dgieselaar , are you happy enough with the findings detailed here? Can this now be closed?
This SDH might be related to this issue: https://github.com/elastic/sdh-ml/issues/616
In the example they shared they are analysing roughly just 30k docs but the categorize_text
queries take 2+ seconds each. @edsavage Is there anything we can ask the customer to send us so you'd be able to investigate more closely?
This SDH might be related to this issue: elastic/sdh-ml#616
In the example they shared they are analysing roughly just 30k docs but the
categorize_text
queries take 2+ seconds each. @edsavage Is there anything we can ask the customer to send us so you'd be able to investigate more closely?
From what I've seen of categorize_text
behaviour (when investigating this issue) its' performance is closely linked to the overall size of the _source
doc and how many fields it has. Ideally, it'd be great to have access to their data set, but a few representative docs might do at a pinch..