elasticsearch categorize_text takes a long time on small data sets with low cardinality

I am running a categorize_text aggregation on a relatively small data set, and the response time is about 18s for 185k docs, and three aggregations, so each aggregation processes about 30k docs/s. I briefly spoke to @droberts195 about this and we are not sure if this is expected. Each aggregation (after removing/adding) seems to take up a third of the response time.

In this case, the aggregations run over process.executable.text, user.name.text and host.os.name.text, which are match_only_text fields. Some observations:

These are low cardinality fields - 1 or 2 values per field
The aggregation returns no results
Running the aggregation over the keyword siblings does return results, and take up the same amount of time.

My questions:

is 30k docs per second per shard expected?
can we speed it up (significantly)?
is not returning any results expected?
is performance affected by the cardinality of the data set, or only the document count (documents with values)? what other factors come into play?
how do we determine whether results are statistically significant?

Context here is that I'm trying to figure out if we can make the log rate analysis API realtime (e.g. instead of a double-digit seconds response). The goal here is to get statistically significant results within <=2.5s from the API, and I'm trying to figure out what options we have to get there, e.g. use a sampler agg and re-poll with a greater probability if the results are not significant, etc.

Mar 11 '24 10:03 dgieselaar

Pinging @elastic/ml-core (Team:ML)

Mar 11 '24 10:03 elasticsearchmachine

@edsavage please could you investigate this one and see if there's some silly duplicated processing happening or processing that's supposed to be one-off singleton initialisation but isn't.

Mar 14 '24 11:03 droberts195

cc @arisonl

Mar 19 '24 18:03 mwtyang

I've attempted to reproduce this locally (on my mac) in order to profile the aggregation (using async-profiler integrated with IntelliJ IDEA), and while I'm not seeing anywhere as near a long response time as 3s, the profiler does show up some areas that might be useful to investigate further.

The dataset I used was 472898 documents from one of the filebeat-nginx indices contained in the filebeat snapshot. I ran a query containing multiple categorize_text aggregations, on low cardinality fields:

GET /filebeat-nginx-elasticco-2017.02.01/_search?filter_path=aggregations
{
  "aggs": {
    "categories1": {
      "categorize_text": {
        "field": "nginx.access.url"
      }
    },
    "categories2": {
      "categorize_text": {
        "field": "nginx.access.user_agent.os_name"
      }
    },
    "categories3": {
      "categorize_text": {
        "field": "nginx.access.user_agent.name"
      }
    }
  }
}

Viewing the CPU Samples profiler output:

shows that the incrementToken stack is taking up ~ 66% of the time of its parent computeCategory

similarly for the Memory Allocations:

While these observations haven't provided conclusive evidence of a smoking gun, they perhaps do indicate areas of the code worthy of closer inspection.

Mar 20 '24 21:03 edsavage

@edsavage do you have benchmarks, so I can have an understanding of how fast it is supposed to be?

Mar 22 '24 14:03 dgieselaar

@dgieselaar Sorry, no I haven't come across any benchmarks for the categorize_text aggregation.

Can I clarify a few points that may help me better understand the issue?

What version of ES are you testing against?
Is the cluster in cloud, or are you running locally?
What are the specs of the cluster? Number of nodes, CPUs memory allocation etc.
Details of the data set you're using? I've found an index .ds-filebeat-8.2.0-2022.06.07-000082 which I think is the one you've been using, but it contains 704246 docs, not 185K
The job configuration? I've been using variations on:

POST /.ds-filebeat-8.2.0-2022.06.07-000082/_async_search?size=0
{
  "aggs": {
    "categories.message": {
      "categorize_text": {
        "field": "message"
      }
    },
    "categories.process_executable_text": {
      "categorize_text": {
        "field": "process.executable"
      }
    },
    "categories.user_name_text": {
      "categorize_text": {
        "field": "user.name"
      }
    },
    "categories.host_os_name_text": {
      "categorize_text": {
        "field": "host.os.name"
      }
    }
  }
}

While I'm still digging in to your questions regarding speed and performance I believe I do have an answer to:

is not returning any results expected?

In the case of match_only_text fields, yes this is expected as fields of this type only have limited support for aggregations, see - https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html#match-only-text-field-type

Mar 24 '24 21:03 edsavage

@edsavage the cluster unfortunately has gone down since. It was a cloud cluster with data from Eden. Note that I did not use async search and I used a range query on @timerange.

re:

In the case of match_only_text fields, yes this is expected as fields of this type only have limited support for aggregations, see - elastic.co/guide/en/elasticsearch/reference/current/text.html#match-only-text-field-type

Does that mean we should never run this on match_only_text fields? cc @walterra as I think the log rate analysis API does include match_only_text fields

Mar 25 '24 07:03 dgieselaar

Good morning @dgieselaar, thanks for the extra info.

Does that mean we should never run this on match_only_text fields? cc @walterra as I think the log rate analysis API does include match_only_text fields

Yes, you should avoid running the categorize_text agg on match_only_text fields. This is because fields of this type are not tokenized, which is essential for the agg to work.

Mar 25 '24 20:03 edsavage

A correction is needed to my statement regarding match_only_text text fields @dgieselaar .

They are suitable for running the categorize_text agg on. However they must be the primary field type when using multi-field types. For example the mapping

                "name": {
                  "type": "match_only_text",
                  "fields": {
                    "raw": {
                      "type": "keyword",
                      "ignore_above": 1024
                    }
                  }
                }

would be suitable for running the categorize_text agg on field name, but not in this case on name.raw. The reason is that searching for the literal primary field name in the _source doc is possible, but not name.raw.

And this leads to an explanation for the slowness of the aggregation, even when no results are returned. The aggregation performs a search in the full _source map for every document in the index, for every field name referred to in the aggregation config. Which results in effectively a constant time overhead.

Without a fundamental change to the aggregation itself the best advice I can give is to reiterate the advice from https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-categorize-text-aggregation.html, i.e.

Re-analyzing large result sets will require a lot of time and memory. This aggregation should be used in conjunction with [Async search](https://www.elastic.co/guide/en/elasticsearch/reference/current/async-search.html). Additionally, you may consider using the aggregation as a child of either the [sampler](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-sampler-aggregation.html) or [diversified sampler](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-diversified-sampler-aggregation.html) aggregation. This will typically improve speed and memory use.

I hope that goes some way to explain the behaviour you've been experiencing. Please do let me know if I can be of any more help.

Mar 26 '24 04:03 edsavage

I had one more thought about this.

Does synthetic source change anything? If categorize_text unconditionally gets field values from _source does that still work if synthetic source is enabled? If not then it would probably be possible to have the aggregation intelligently look in doc values fields as an alternative, which might mean looking in a multi-field’s doc value.

Isn’t there a plan to switch logs indices over to synthetic source on the grounds that the benefits outweigh the costs? But since being able to categorize logs is pretty important it’s worth checking the effect on categorization.

Mar 27 '24 20:03 droberts195

Does synthetic source change anything?

Thanks for the advice kind stranger.

The code in question is

https://github.com/elastic/elasticsearch/blob/eea8d1d65986e95a251a57c1c0f192aca160caca/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/CategorizeTextAggregator.java#L167

This exhibits the behaviour that it can retrieve the source for a "parent" multi-field - e.g. host.os.name (specified in sourceFilter) but returns null for a child multi-field e.g. host.os.name.text. The question is then how does getSource behave when synthetic source is enabled - something I'll look into now.

Mar 27 '24 20:03 edsavage

Checking the effect of synthetic source on the categorize_text aggregation.

I reindexed .ds-filebeat-8.2.0-2022.06.07-000082 as synthetic-filebeat-8.2.0-2022.06.07-000082, adding

    "_source": {
      "mode": "synthetic"
    }

to the mappings (a handful of fields required massaging to be compatible with the synthetic mode, in particular fields that were configured as nested).

There were no changes in behaviour related to the categorize_text agg. The aggregation continues to return valid results for primary multi-field values and, as before, fails to return results for nested multi-field values.

Mar 28 '24 04:03 edsavage

Many thanks for your investigations so far! Finally found the time to look into why we're choosing these fields for log rate analysis in the first place. We had some code in place that checks if a field is mapped both as keyword and text and it should only pick keyword. Turns out there was a bug in that code that still added the text fields as suitable for analysis where it shouldn't. I have a PR up that should fix this: https://github.com/elastic/kibana/pull/179699

Some brief testing shows that an uncached analysis of the pgbench example dataset comes down from previously ~8s to ~1-2s once we no longer run these unnecessary queries!

The bug was that we had a check that correctly identified for example myField.keyword and myField as text to keep myField.keyword, but we missed checking the other way around to pick myField instead of myField.text.

Mar 29 '24 16:03 walterra

@walterra nice!

Mar 29 '24 17:03 dgieselaar

@dgieselaar , are you happy enough with the findings detailed here? Can this now be closed?

May 14 '24 23:05 edsavage

This SDH might be related to this issue: https://github.com/elastic/sdh-ml/issues/616

In the example they shared they are analysing roughly just 30k docs but the categorize_text queries take 2+ seconds each. @edsavage Is there anything we can ask the customer to send us so you'd be able to investigate more closely?

Jun 11 '24 07:06 walterra

This SDH might be related to this issue: elastic/sdh-ml#616

In the example they shared they are analysing roughly just 30k docs but the categorize_text queries take 2+ seconds each. @edsavage Is there anything we can ask the customer to send us so you'd be able to investigate more closely?

From what I've seen of categorize_text behaviour (when investigating this issue) its' performance is closely linked to the overall size of the _source doc and how many fields it has. Ideally, it'd be great to have access to their data set, but a few representative docs might do at a pinch..

Jun 11 '24 21:06 edsavage

elasticsearch elasticsearch copied to clipboard

categorize_text takes a long time on small data sets with low cardinality

elasticsearch
elasticsearch copied to clipboard