roadrunner icon indicating copy to clipboard operation
roadrunner copied to clipboard

[🐛 BUG]: Downscale workers didn't work correctly

Open Smolevich opened this issue 10 months ago • 13 comments

No duplicates 🥲.

  • [X] I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

TEMPORAL_ROADRUNNER_MAX_WORKERS_COUNT=20 TEMPORAL_ROADRUNNER_WORKERS_COUNT=5

When starting the Roadrunner server with the configuration specified below, I observed an issue related to worker scaling: • Initially, the number of workers in the configuration file was set to 5. • The maximum number of workers allowed under load for Temporal Activity is 20.

During a heavy load scenario involving tens of thousands of workflows and hundreds of thousands of activities, the scaling correctly increased the worker count. The total number of Activity workers reached 28.

However, once the load dropped to zero (all activities were fully processed), the number of workers did not scale back down to the original value of 5.

Version (rr --version)

2024.3.0 (build time: 2024-12-05T18:39:32+0000, go1.23.4), OS: linux, arch: amd64

How to reproduce the issue?

version: "3"

rpc:
    listen: tcp://0.0.0.0:6001

server:
    command: "php ../bin/console app:temporal-worker-run"

temporal:
    namespace: ${TEMPORAL_NAMESPACE}
    address: ${TEMPORAL_ADDRESS}
    activities:
        max_jobs: 100
        allocate_timeout: 360s
        command: "php ../bin/console app:temporal-worker-run"
        num_workers: ${TEMPORAL_ROADRUNNER_WORKERS_COUNT}
        destroy_timeout: 1s
        dynamic_allocator:
          max_workers: ${TEMPORAL_ROADRUNNER_MAX_WORKERS_COUNT}
          spawn_rate: 5
          idle_timeout: 10s
    metrics:
        address: ${TEMPORAL_ADDRESS_METRICS}
        prefix: "mars"
        type: "summary"

metrics:
    address: ${RR_ADDRESS_METRICS}

logs:
    mode: production
    level: debug
    output: stderr

otel:
  resource:
    service_name: "roadrunner"
    service_version: "1.0.0"
    service_namespace: "${OTEL_RESOURCE_NAMESPACE}"
    service_instance_id: "${HOSTNAME}"
  exporter: otlp
  endpoint: "${OTEL_EXPORTER_OTLP_ENDPOINT}"
  headers:
    api-key: "${NEW_RELIC_API_KEY}"

Steps to Reproduce

    1.	Configure Roadrunner with an initial worker count of 5 and a maximum scaling limit of 20.
2.	Create a single Workflow with at least one Activity inside it.
3.	Submit approximately 10,000 Workflows with the corresponding Activities to Temporal.
4.	Observe that the worker count scales up under the load (e.g., reaching 28 workers).
5.	Allow all Activities to be processed so that the load drops completely to zero.
6.	Check the worker count after load reduction.

Relevant log output

{"level":"info","ts":1734352476002676546,"logger":"temporal","msg":"Activity complete after timeout.","Namespace":"stage-10","TaskQueue":"default","WorkerID":"default:20f10cbd-3161-4b99-98da-390fc6e94208","WorkflowID":"grouping_auto_pso_new_bi_main_10e6e20e-03fe-45a5-aa40-cdea46e79880_20241216","RunID":"c4af45d9-149a-4028-a2e7-110e9a91e925","ActivityType":"grouping_candidates.isOperBrandPlanLimitExceeded","Attempt":1,"Result":"<nil>","Error":"activity_pool_execute_activity:\n\tstatic_pool_exec:\n\tallocate_dynamically: failed to reset the TTL listener"}
{"level":"info","ts":1734352476002767912,"logger":"temporal","msg":"Task processing failed with client side error","Namespace":"stage-10","TaskQueue":"default","WorkerID":"default:20f10cbd-3161-4b99-98da-390fc6e94208","WorkerType":"ActivityWorker","Error":"context deadline exceeded"}
{"level":"debug","ts":1734352476002978390,"logger":"server","msg":"No free workers, trying to allocate dynamically","idle_timeout":10,"max_workers":20,"spawn_rate":5}
{"level":"debug","ts":1734352476003002985,"logger":"server","msg":"dynamic allocator listener already started, trying to allocate worker immediately with 2s timeout"}
{"level":"debug","ts":1734352476010849713,"logger":"temporal","msg":"workflow task started","time":1}

Smolevich avatar Dec 16 '24 16:12 Smolevich