conductor icon indicating copy to clipboard operation
conductor copied to clipboard

Conductor Radis Sentinel Configuration not queueing work

Open ZergRushJoe opened this issue 1 year ago • 4 comments

Discussed in https://github.com/Netflix/conductor/discussions/3061

Originally posted by ZergRushJoe June 22, 2022 I have been trying to move over are conductor instance to redis sentinel. my current config is

# Servers.
conductor.grpc-server.enabled=false

# Database persistence type.
conductor.db.type=redis_sentinel
conductor.redis.hosts=conductor-redis-node-0.conductor-redis-headless.conductor-playground.svc.cluster.local:26379:cluster:**********;conductor-redis-node-1.conductor-redis-headless.conductor-playground.svc.cluster.local:26379:cluster;conductor-redis-node-2.conductor-redis-headless.conductor-playground.svc.cluster.local:26379:cluster
conductor.redis.clusterName=mymaster


# Namespace for the keys stored in Dynomite/Redis
conductor.redis.workflowNamespacePrefix=conductor

# Namespace prefix for the dyno queues
conductor.redis.queueNamespacePrefix=conductor_queues

# Hikari pool sizes are -1 by default and prevent startup
spring.datasource.hikari.maximum-pool-size=10
spring.datasource.hikari.minimum-idle=2

# Elastic search instance indexing is enabled.
conductor.indexing.enabled=true

# Transport address to elasticsearch
conductor.elasticsearch.url=http://conductor-elasticsearch.conductor-playground.svc.cluster.local:9200

# Name of the elasticsearch cluster
conductor.elasticsearch.indexName=conductor

# Yellow is main cluster node is up and running for elasticsearch
conductor.elasticsearch.clusterHealthColor=yellow
the connection string is formatted as follows

I been getting this info log in conductor itself:

554563 [pool-26-thread-1] INFO  com.netflix.dyno.queues.redis.RedisDynoQueue [] - processUnacks() will NOT be atomic.
614067 [pool-21-thread-1] INFO  com.netflix.dyno.queues.redis.RedisDynoQueue [] - processUnacks() will NOT be atomic.
614567 [pool-27-thread-1] INFO  com.netflix.dyno.queues.redis.RedisDynoQueue [] - processUnacks() will NOT be atomic.
614567 [pool-25-thread-1] INFO  com.netflix.dyno.queues.redis.RedisDynoQueue [] - processUnacks() will NOT be atomic.
614567 [pool-26-thread-1] INFO  com.netflix.dyno.queues.redis.RedisDynoQueue [] - processUnacks() will NOT be atomic.

The instance does not seem to be queuing work to be done. system task just remain in progress forever. Does anyone have any ideas on what i'm doing wrong.

This is on conductor 3.10.3 version

image Debug logs:

292333 [scheduled-task-pool-2] DEBUG com.netflix.conductor.core.reconciliation.WorkflowReconciler [] - Sweeper processed  from the decider queue
292341 [pool-18-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:HTTP, got 0 tasks
292341 [pool-16-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:SUB_WORKFLOW, got 0 tasks
292341 [pool-17-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:START_WORKFLOW, got 0 tasks
292391 [pool-18-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: HTTP with 1 slots acquired
292391 [pool-16-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: SUB_WORKFLOW with 1 slots acquired
292391 [pool-17-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: START_WORKFLOW with 1 slots acquired
292591 [pool-18-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:HTTP, got 0 tasks
292592 [pool-16-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:SUB_WORKFLOW, got 0 tasks
292592 [pool-17-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:START_WORKFLOW, got 0 tasks
292642 [pool-18-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: HTTP with 1 slots acquired
292642 [pool-16-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: SUB_WORKFLOW with 1 slots acquired
292642 [pool-17-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: START_WORKFLOW with 1 slots acquired
292842 [pool-16-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:SUB_WORKFLOW, got 0 tasks
292842 [pool-18-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:HTTP, got 0 tasks
292842 [pool-17-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:START_WORKFLOW, got 0 tasks
292892 [pool-16-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: SUB_WORKFLOW with 1 slots acquired
292893 [pool-18-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: HTTP with 1 slots acquired
292893 [pool-17-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: START_WORKFLOW with 1 slots acquired
293093 [pool-16-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:SUB_WORKFLOW, got 0 tasks
293093 [pool-17-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:START_WORKFLOW, got 0 tasks
293094 [pool-18-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:HTTP, got 0 tasks
293144 [pool-16-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: SUB_WORKFLOW with 1 slots acquired
293144 [pool-17-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: START_WORKFLOW with 1 slots acquired
293144 [pool-18-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: HTTP with 1 slots acquired
293344 [pool-16-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:SUB_WORKFLOW, got 0 tasks
293344 [pool-18-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:HTTP, got 0 tasks
293344 [pool-17-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue:START_WORKFLOW, got 0 tasks
293394 [pool-18-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: HTTP with 1 slots acquired
293394 [pool-16-thread-1] DEBUG com.netflix.conductor.core.execution.tasks.SystemTaskWorker [] - Polling queue: SUB_WORKFLOW with 1 slots acquired

ZergRushJoe avatar Aug 10 '22 11:08 ZergRushJoe

Added redis locking to see if that was the problem but got a new error

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.fasterxml.jackson.module.afterburner.util.MyClassLoader (jar:file:/app/libs/conductor-server-3.11.0-SNAPSHOT-boot.jar!/BOOT-INF/lib/jackson-module-afterburner-2.13.3.jar!/) to method java.lang.ClassLoader.findLoadedClass(java.lang.String)
WARNING: Please consider reporting this to the maintainers of com.fasterxml.jackson.module.afterburner.util.MyClassLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
72045 [pool-21-thread-1] INFO  com.netflix.dyno.queues.redis.RedisDynoQueue [] - processUnacks() will NOT be atomic.
72515 [pool-25-thread-1] INFO  com.netflix.dyno.queues.redis.RedisDynoQueue [] - processUnacks() will NOT be atomic.
72515 [pool-26-thread-1] INFO  com.netflix.dyno.queues.redis.RedisDynoQueue [] - processUnacks() will NOT be atomic.
72515 [pool-27-thread-1] INFO  com.netflix.dyno.queues.redis.RedisDynoQueue [] - processUnacks() will NOT be atomic.
92957 [pool-29-thread-1] INFO  com.netflix.dyno.queues.redis.RedisDynoQueue [] - processUnacks() will NOT be atomic.

settings

    #Redis cluster settings for locking module
    conductor.redis-lock.serverType=sentinel 
    #Comma separated list of server nodes
    conductor.redis-lock.serverAddress={{ $redisLockConnectionString }}
    #Redis sentinel master name
    conductor.redis-lock.serverMasterName=mymaster
    conductor.redis-lock.namespace =conductor_locks
    conductor.redis-lock.serverPassword={{ $redisPassword }}
    # Namespace for the keys stored in Dynomite/Redis
    conductor.redis.workflowNamespacePrefix=conductor

ZergRushJoe avatar Aug 10 '22 18:08 ZergRushJoe

@ZergRushJoe I don't see any issues with the setup. Your logs do not point to any errors either. The system task poller seems to be polling the queues actively but does not seem to dequeue any tasks. Would you be able to login to your redis instance and check if the messages are being populated? Additionally, could you also post your workflow definition?

apanicker-nflx avatar Aug 15 '22 22:08 apanicker-nflx

I am experiencing the exact same behavior on redis_standalone, I checked the Redis keys and I see them, but while polling nothing is found:

10.197.33.28:6379> KEYS *

  1. "workflows.production.general.WORKFLOW_DEF.send-batch-messages"
  2. "workflows.production.general.CORR_ID_TO_WORKFLOWS.63a8dc39-d460-4534-88f2-168e51b9d8f7-0"
  3. "workflows.production.general.SCHEDULED_TASKS.838cdf35-ff3a-4eb6-b6c3-83915c1d8ab7"
  4. "workflows.production.general.WORKFLOW_DEF_TO_WORKFLOWS.send-batch-messages.20220821"
  5. "workflows.production.general.WORKFLOW.838cdf35-ff3a-4eb6-b6c3-83915c1d8ab7"
  6. "workflows.production.general.WORKFLOW_TO_TASKS.838cdf35-ff3a-4eb6-b6c3-83915c1d8ab7"
  7. "workflows.production.general.TASK_DEFS"
  8. "queues.production.general.QUEUE._deciderQueue.1"
  9. "workflows.production.general.IN_PROGRESS_TASKS.clear-blacklisted-recipients"
  10. "queues.production.general.MESSAGE.HTTP"
  11. "workflows.production.general.TASK.eb64a922-b38e-46a1-b125-bde02abd1eaf"
  12. "workflows.production.general.PENDING_WORKFLOWS.send-batch-messages"
  13. "queues.production.general.QUEUE.HTTP.1"
  14. "queues.production.general.MESSAGE._deciderQueue"
  15. "workflows.production.general.WORKFLOW_DEF_NAMES"
  16. "workflows.production.general.WORKFLOW_DEF.finalize-with-retries-and-publish"

arielzadi avatar Aug 21 '22 21:08 arielzadi

Same in my redis cluster. It gets into redis just never de queues anything

On Sun, Aug 21, 2022, 5:12 PM arielzadi @.***> wrote:

I am experiencing the exact same behavior on redis_standalone, I checked the Redis keys and I see them, but while polling nothing is found:

10.197.33.28:6379> KEYS *

  1. "workflows.production.general.WORKFLOW_DEF.send-batch-messages"

"workflows.production.general.CORR_ID_TO_WORKFLOWS.63a8dc39-d460-4534-88f2-168e51b9d8f7-0" 3. "workflows.production.general.SCHEDULED_TASKS.838cdf35-ff3a-4eb6-b6c3-83915c1d8ab7" 4. "workflows.production.general.WORKFLOW_DEF_TO_WORKFLOWS.send-batch-messages.20220821" 5. "workflows.production.general.WORKFLOW.838cdf35-ff3a-4eb6-b6c3-83915c1d8ab7" 6. "workflows.production.general.WORKFLOW_TO_TASKS.838cdf35-ff3a-4eb6-b6c3-83915c1d8ab7" 7. "workflows.production.general.TASK_DEFS" 8. "queues.production.general.QUEUE._deciderQueue.1" 9. "workflows.production.general.IN_PROGRESS_TASKS.clear-blacklisted-recipients" 10. "queues.production.general.MESSAGE.HTTP" 11. "workflows.production.general.TASK.eb64a922-b38e-46a1-b125-bde02abd1eaf" 12. "workflows.production.general.PENDING_WORKFLOWS.send-batch-messages" 13. "queues.production.general.QUEUE.HTTP.1" 14. "queues.production.general.MESSAGE._deciderQueue" 15. "workflows.production.general.WORKFLOW_DEF_NAMES" 16. "workflows.production.general.WORKFLOW_DEF.finalize-with-retries-and-publish"

— Reply to this email directly, view it on GitHub https://github.com/Netflix/conductor/issues/3170#issuecomment-1221621882, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWSM4YVFN44E2572ZVM7ELV2KLUZANCNFSM56EFPHTA . You are receiving this because you were mentioned.Message ID: @.***>

ZergRushJoe avatar Aug 23 '22 00:08 ZergRushJoe

This issue is stale, because it has been open for 45 days with no activity. Remove the stale label or comment, or this will be closed in 7 days.

github-actions[bot] avatar Oct 08 '22 00:10 github-actions[bot]

This issue was closed, because it has been stalled for 7 days with no activity.

github-actions[bot] avatar Oct 15 '22 00:10 github-actions[bot]