index-management icon indicating copy to clipboard operation
index-management copied to clipboard

[FEATURE] Allow Rollup jobs to handle huge index in chunks

Open DucQuach opened this issue 1 year ago • 11 comments

Is your feature request related to a problem? Everyday, I have an index with the size ~ 135GB of primary shards (automatically rolled over) that I need to roll up but the jobs keep failing because of "heap usage exceeded", "elapsed time exceeded" due to the data set is too big. The roll up job has 1 aggs sum field and 6 dimensions. Each dimension has lots of values (10k-200k, like IP address, hostname) I have tried multiple settings value for page_size and datetime's interval but none of it worked Rollup jobs currently calculate the whole index in 1 go.

What solution would you like? I want to have a configuration where I can divide the data into smaller chunks (e.g: by datetime, by size). For example, the rollup job will query data in a 30-minute window, calculate sum,avg,min,max and move on to the next 30-minute window.

What alternatives have you considered? None

Do you have any additional context? When I open the debug logs, this is the query the rollup job is using.

    "size": 0,
    "query": {
        "match_all": {
            "boost": 1.0
        }
    },
    "track_total_hits": -1,
    "aggregations": {
        "test": {
            "composite": {
                "size": 1,
                "sources": [
                    {
                        "datetime.date_histogram": {
                            "date_histogram": {
                                "field": "datetime",
                                "missing_bucket": true,
                                "order": "asc",
                                "fixed_interval": "30m",
                                "time_zone": "UTC"
                            }
                        }
                    },
                    {
                        "clientip.terms": {
                            "terms": {
                                "field": "clientip",
                                "missing_bucket": true,
                                "order": "asc"
                            }
                        }
                    }
                ]
            },
            "aggregations": {
                "transactionsize.sum": {
                    "sum": {
                        "field": "transactionsize"
                    }
                }
            }
        }
    }
}

DucQuach avatar Sep 05 '23 17:09 DucQuach

Do you want to try continuous mode rollup, and set the running schedule to be 30m

bowenlan-amzn avatar Sep 15 '23 06:09 bowenlan-amzn

Do you want to try continuous mode rollup, and set the running schedule to be 30m

I thought about that. I have a large set of data each day so that means I have to roll over the roll up indices as well. Unfortunately, when you setup a roll up job, if you use wildcard in the source index, you can’t use dynamic target (e.g: rollup_{{ctx.source_index}} ). And because it’s a continuous rollup job, you can’t roll over the target index.

DucQuach avatar Sep 15 '23 11:09 DucQuach

Remember there's an example from the documentation that seems to help here. Pls let us know if whether it's working.

bowenlan-amzn avatar Sep 19 '23 03:09 bowenlan-amzn

Remember there's an example from the documentation that seems to help here. Pls let us know if whether it's working.

Yes, I have tried this and it was the original problem, the index size was too big by the time it rollover & rollup, causing the job to always fails (heap memory, execution time)

DucQuach avatar Sep 19 '23 11:09 DucQuach

Got it, the rollup created by ISM action is fixed to be non-continuous https://github.com/opensearch-project/index-management/blob/a2dd769f8ca62b45d89b8ba6e0d32770eeebf1ae/src/main/kotlin/org/opensearch/indexmanagement/rollup/model/ISMRollup.kt#L80

If this data class ISMRollup supports user to provide continuous parameter, then I think using ISM we can create a continuous rollup first and rollover later. This seems to be a way forward.

@DucQuach wanna check first if you'd like to help contribute this?

bowenlan-amzn avatar Sep 19 '23 16:09 bowenlan-amzn

Got it, the rollup created by ISM action is fixed to be non-continuous

https://github.com/opensearch-project/index-management/blob/a2dd769f8ca62b45d89b8ba6e0d32770eeebf1ae/src/main/kotlin/org/opensearch/indexmanagement/rollup/model/ISMRollup.kt#L80

If this data class ISMRollup supports user to provide continuous parameter, then I think using ISM we can create a continuous rollup first and rollover later. This seems to be a way forward.

@DucQuach wanna check first if you'd like to help contribute this?

No, sorry. I’m just a user.

DucQuach avatar Sep 19 '23 16:09 DucQuach

We will plan this, thanks!

bowenlan-amzn avatar Sep 19 '23 17:09 bowenlan-amzn

@DucQuach I checked the 2 error messages you mentioned "heap usage exceeded", "elapsed time exceeded", seems these are from search backpressure feature, would you want to try tune the settings of this feature or disable it and see if it works for you? https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/search-backpressure/#search-backpressure-settings

The rollup query doesn't seem too heavy to me also.

bowenlan-amzn avatar Sep 30 '23 16:09 bowenlan-amzn

I'm using AWS Managed Opensearch service so it can't be configured. What is the size of your index? Can you try with an index size >= 135GB. I'm trying to roll up the index by 5 fields, each field can have up to 20-50k unique value.

Another way I can resolve this is to use continuous roll up jobs with the same target index. But then it doesn't support deleting old data in the rollup index. Compared to elasticsearch, they support _delete_by_query in a rollup index (they have had this feature since 6.x) https://www.elastic.co/guide/en/elasticsearch/reference/current/rollup-delete-job.html#:~:text=If%20you%20wish%20to%20also,ID%20in%20the%20rollup%20index. Is it possible for Opensearch to support this feature?

DucQuach avatar Sep 30 '23 17:09 DucQuach

I see, would you reach out to AWS support and ask them to cut a ticket to the service team regarding this issue.

bowenlan-amzn avatar Oct 01 '23 17:10 bowenlan-amzn

alright, thanks

DucQuach avatar Oct 01 '23 18:10 DucQuach