conductor
conductor copied to clipboard
How to improve the throughput of the conductor?
Describe the bug A clear and concise description of what the bug is.
Details Conductor version: 3.11.0 Persistence implementation: redis Queue implementation: redis Lock: Redis Workflow definition: Task definition: Event handler definition:
To Reproduce Steps to reproduce the behavior: Conditions: Conductor cluster: 3 instances. Each instance has 8 cores and 16G. Worker service: 1 cluster (2 instances, each with 8 cores and 8G)
I found in the pressure test that when the qps of a single workflow is 10, it performs well and takes about 1s to run. However, at 20, it took more than 1 minute; When the qps is 30, it takes more than 2 minutes. Please help me find any predefined solution/design. Thanks in advance!
I found in the pressure test that the delay between the completion of one task and the start of the next task is particularly large. How can I optimize it?
Expected behavior How to reduce latency under high throughput (such as starting workflow: 100/s).
Screenshots
The following is my configuration: `spring.application.name=conductor springdoc.api-docs.path=/api-docs
conductor.db.type=redis_standalone conductor.redis.hosts=10.00.00.00:3380:us-east-1c:666666 conductor.redis.dataCenterRegion=us-east-1 conductor.redis.availabilityZone=us-east-1c conductor.redis.maxConnectionsPerHost=1000 conductor.redis.taskDefCacheRefreshInterval = 1 #conductor.redis.eventExecutionPersistenceTTL conductor.redis.workflowNamespacePrefix=conductor-stable conductor.redis.queueNamespacePrefix=conductor-queue-stable conductor.redis.queuesNonQuorumPort=22122 queues.dynomite.threads=20
conductor.indexing.enabled=true
conductor.elasticsearch.url=10.00.00.00:9260 conductor.elasticsearch.indexName=conductor-stable conductor.elasticsearch.version=7 conductor.elasticsearch.asyncWorkerQueueSize=100 conductor.elasticsearch.asyncMaxPoolSize=12 conductor.elasticsearch.asyncBufferFlushTimeout=10 conductor.elasticsearch.username=es conductor.elasticsearch.password=tsp1024 conductor.default-event-queue.type=conductor
conductor.app.workflowExecutionLockEnabled=true conductor.workflow-execution-lock.type=redis conductor.app.lockLeaseTime=60000 conductor.app.lockTimeToTry=50 conductor.redis-lock.serverType=single conductor.redis-lock.serverAddress=redis://10.00.00.00:3380 conductor.redis-lock.serverPassword=abcabcabc88999 conductor.redis-lock.namespace=conductor_stable logging.config=classpath:log4j2.xml
conductor.workflow-monitor.enabled=true #Disable default conductor.workflow-reconciler.enabled=false conductor.workflow-repair-service.enabled=false
conductor.app.systemTaskWorkerPollInterval=1 conductor.app.systemTaskMaxPollCount=10 conductor.app.systemTaskWorkerThreadCount=10
conductor.app.maxTaskOutputPayloadSizeThreshold=102400 conductor.app.maxTaskInputPayloadSizeThreshold=102400 conductor.app.taskOutputPayloadSizeThreshold=102400 conductor.app.taskInputPayloadSizeThreshold=
conductor.app.sweeperThreadCount=10 conductor.sweep-frequency.millis=1`
I found in the pressure test that when the qps of a single workflow is 10, it performs well and takes about 1s to run. However, at 20, it took more than 1 minute; When the qps is 30, it takes more than 2 minutes.
I presume that you are referring to the rcc_getDeviceNetworkStatus
task. If its a SIMPLE, please check if you are increasing the worker threads (or instances) as you increase the number of workflows. If its a system task, consider tweaking the values for
conductor.app.systemTaskMaxPollCount
conductor.app.systemTaskWorkerThreadCount