conductor icon indicating copy to clipboard operation
conductor copied to clipboard

es7-persistance creates unlimited amount of indices

Open astelmashenko opened this issue 1 year ago • 6 comments

Describe the bug After certain amount of time shards problem raised. We checked ES and found out around 2300 shards, names like conductor_task_log_20221004.

Investigating code I see

// class ElasticSearchRestDAOV7
    private void createIndexesTemplates() {
        try {
            initIndexesTemplates();
            updateIndexesNames();
            Executors.newScheduledThreadPool(1).scheduleAtFixedRate(this::updateIndexesNames, 0, 1, TimeUnit.HOURS);
        } catch (Exception e) {
            logger.error("Error creating index templates!", e);
        }
    }
//...
    private void updateIndexesNames() {
        logIndexName = updateIndexName(LOG_DOC_TYPE);
        eventIndexName = updateIndexName(EVENT_DOC_TYPE);
        messageIndexName = updateIndexName(MSG_DOC_TYPE);
    }

    private String updateIndexName(String type) {
        String indexName =
                this.indexPrefix + "_" + type + "_" + SIMPLE_DATE_FORMAT.format(new Date());
        try {
            addIndex(indexName);
            return indexName;
        } catch (IOException e) {
            logger.error("Failed to update log index name: {}", indexName, e);
            throw new NonTransientException(e.getMessage(), e);
        }
    }

Where updateIndexesNames creates new index every week. Can someone explain why does it change names for indices?

*_conductor_task_log*
*_conductor_message*
*_conductor_event*

I'd like to change this behavior, because we reaching limits on shards.

I also checked below methods are not used anywhere:

IndexDAO.getEventExecutions
IndexDAO.getMessages

so it is probably safe to stop indexing them:

conductor.app.eventMessageIndexingEnabled=false
conductor.app.eventExecutionIndexingEnabled=false

However to stop creating new indices we need to change this config:

conductor.elasticsearch.autoIndexManagementEnabled=false

And provision indices for conductor manually.

Details Conductor version: 3.18.0+ Persistence implementation: Postgres Queue implementation: Postgres Lock: Redis

To Reproduce Steps to reproduce the behavior: Just use elastic for a while

Expected behavior Data is cleaned up periodically.

astelmashenko avatar Aug 01 '24 12:08 astelmashenko

I'm also studying house keeping mechanism now I think it's a good idea to have elasticsearch ILM set on index patterns, though it's not implemented yet in conductor so you have to manually configure it on elasticsearch

hallo1144 avatar Aug 15 '24 02:08 hallo1144

👋 Hi @astelmashenko

We're currently reviewing open issues in the Conductor OSS backlog, and noticed that this issue hasn't been addressed.

To help us keep the backlog focused and actionable, we’d love your input:

  • Is this issue still relevant?
  • Has the problem been resolved in the latest version v3.21.12?
  • Do you have any additional context or updates to provide?

If we don’t hear back in the next 14 days, we’ll assume this issue is no longer active and will close it for housekeeping. Of course, if it's still a valid issue, just let us know and we’ll keep it open!

Thanks for contributing to Conductor OSS! We appreciate your support. 🙌

Jeff Bull

Developer Community Manager | Orkes

DM on Conductor Slack Email me!

jeffbulltech avatar Feb 27 '25 01:02 jeffbulltech

@jeffbulltech , yes the issue is still relevant for any version, I checked, the codebase of es7-persistence is still the same.

astelmashenko avatar Feb 27 '25 07:02 astelmashenko

@jeffbulltech , yes the issue is still relevant for any version, I checked, the codebase of es7-persistence is still the same.

Thanks for getting back to me @astelmashenko I'll make sure this issue remains open so it can be reviewed for an upcoming release.

jeffbulltech avatar Feb 27 '25 17:02 jeffbulltech

Yes. this Issue is still relevant.

We are now also running into ES7 limits where it even crashes our Prod System...

4710 [main] ERROR com.netflix.conductor.es7.dao.index.ElasticSearchRestDAOV7 [] - Error creating index templates!

com.netflix.conductor.core.exception.NonTransientException: method [PUT], host [http://es:9200], URI [/conductor_task_log_20250804], status line [HTTP/1.1 400 Bad Request]

error={"root_cause":[{"type":"validation_exception","reason":"Validation Failed: 1: this action would add [10] shards, but this cluster currently has [997]/[1000] maximum normal shards open;"}],"type":"validation_exception","reason":"Validation Failed: 1: this action would add [10] shards, but this cluster currently has [997]/[1000] maximum normal shards open;"} status=400

hexxone avatar Aug 18 '25 12:08 hexxone

When I investigated this via ES:9200/_cluster/allocation/explain,

I could see the following reason: 'A copy of this shard is already allocated to this node: [conductor_event_20250804][1], [node XXXX], [P], [s STARTED], [a id=XXXX]'.

By default, Conductor has indexShardCount = 5 and indexReplicasCount = 1. However, when running ES as a single node, I assume this will cause 'duplicate' indices to be created as 'unassigned' shards, which will fill up the node very quickly.

I know it's best practice to have multiple ES nodes, but ensuring the Search workflow execution is "accurate" has not been a high priority for us, especially on testing/dev systems which get "reset" very often anyways, so the problem doesnt occur there.

For now, I am therefore using these settings for ES single node:

  • conductor.elasticsearch.indexShardCount=2
  • conductor.elasticsearch.index.replicas.count=0

It was hard to find because it is not mentioned in the Conductor documentation:

  • https://conductor-oss.github.io/conductor/documentation/configuration/appconf.html#example-usage

Taken directly from here:

  • https://github.com/conductor-oss/conductor/blob/main/es7-persistence/src/main/java/com/netflix/conductor/es7/config/ElasticSearchProperties.java

hexxone avatar Aug 21 '25 11:08 hexxone