flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[Core Feature]: Allow to configure agent status update frequency

Open thomas-maschler opened this issue 1 year ago • 8 comments

I have an asynchronous agent that schedules long-running jobs using a third-party service. That service has request quotas. Right now, agents request a status update every 5s. When running many long-running tasks concurrently I frequently run into quota limits and my status updates fail.

I would like to configure the status update frequency for my agent in a similar way I can configure timeouts. https://docs.flyte.org/en/latest/flyte_agents/developing_agents.html#canary-deployment

thomas-maschler avatar Sep 09 '24 17:09 thomas-maschler

Thank you for opening your first issue here! 🛠

welcome[bot] avatar Sep 09 '24 17:09 welcome[bot]

You should be able to configure it by updating this config

pingsutw avatar Sep 09 '24 18:09 pingsutw

@thomas-maschler as @pingsutw said, this is already supported in Agents today. please use this to adjust. @pingsutw maybe we should make it possible to adjust it from the AgentMetadata service (in the future)?

kumare3 avatar Sep 11 '24 04:09 kumare3

@pingsutw following up on this thread.

I tried updating our config and used webApi.caching.resyncInterval to change the update frequency. That didn't work and my agent still polled every ~10sec.

flyteagent:
  enabled: true
  plugin_config:
    plugins:
      agent-service:
        defaultAgent:
          endpoint: "dns:///flyteagent.flyte-core.svc.cluster.local:8000"
          insecure: true
        agents:
          my-custom-agent:
            endpoint: "dns:///my-custom-agent.my-namespace.svc.cluster.local:8000"
            insecure: true
            timeouts:
              CreateTask: 5s
              GetTask: 30s
              DeleteTask: 30s
            defaultTimeout: 30s
            # Only run the agent `get` method once a minute
            webApi:
              caching:
                resyncInterval: 60s

I also tried using pollInterval without success

flyteagent:
  enabled: true
  plugin_config:
    plugins:
      agent-service:
        defaultAgent:
          endpoint: "dns:///flyteagent.flyte-core.svc.cluster.local:8000"
          insecure: true
        agents:
          my-custom-agent:
            endpoint: "dns:///my-custom-agent.my-namespace.svc.cluster.local:8000"
            insecure: true
            timeouts:
              CreateTask: 5s
              GetTask: 30s
              DeleteTask: 30s
            defaultTimeout: 30s
            # Only run the agent `get` method once a minute
            pollInterval: 60s

Can you post an example on how to configure the canary agent correctly to poll less frequently?

thomas-maschler avatar Mar 04 '25 17:03 thomas-maschler

I think the problem with both configs is that they are one level too deep. Looking at the docs both should be set at the agent level like this:

flyteagent:
  enabled: true
  plugin_config:
    plugins:
      # -- Agent service configuration for propeller.
      agent-service:
        # -- The default agent service to use for plugin tasks.
        defaultAgent:
          # -- The agent service endpoint propeller should connect to.
          endpoint: "dns:///flyteagent.flyte-core.svc.cluster.local:8000"
          # -- Whether the connection from propeller to the agent service should use TLS.
          insecure: true
        # -- The task types supported by the default agent.
        # Only check for status every 2 minutes
        pollInterval: 120s
        webApi:
          caching:
            resyncInterval: 120s
        agents:

cpaulik avatar Mar 05 '25 08:03 cpaulik

The idea was to only throttle the canary deployments, not the rest.I talked to someone on the Flyte Slack and apparently both configs do something else. We would need to modify            webApi:              readRateLimiter:                burst: 0                qps: 0.01666666666But floats are not yet supported. However there is a PR.https://flyte-org.slack.com/archives/C06SYN9QJ5N/p1741109569518379?thread_ts=1741109569.518379&cid=C06SYN9QJ5NI am working around this now by logging the last sync time inside my agent and only execute my code if enough time has passed.https://hello.planet.com/code/planetary-variables/forests/ascilero/-/merge_requests/6On Mar 5, 2025 03:32, Christoph Paulik @.> wrote: I think the problem with both configs is that they are one level too deep. Looking at the docs both should be set at the agent level like this: flyteagent: enabled: true plugin_config: plugins: # -- Agent service configuration for propeller. agent-service: # -- The default agent service to use for plugin tasks. defaultAgent: # -- The agent service endpoint propeller should connect to. endpoint: "dns:///flyteagent.flyte-core.svc.cluster.local:8000" # -- Whether the connection from propeller to the agent service should use TLS. insecure: true # -- The task types supported by the default agent. # Only check for status every 2 minutes pollInterval: 120s webApi: caching: resyncInterval: 120s agents: —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>

cpaulik left a comment (flyteorg/flyte#5733) I think the problem with both configs is that they are one level too deep. Looking at the docs both should be set at the agent level like this: flyteagent: enabled: true plugin_config: plugins: # -- Agent service configuration for propeller. agent-service: # -- The default agent service to use for plugin tasks. defaultAgent: # -- The agent service endpoint propeller should connect to. endpoint: "dns:///flyteagent.flyte-core.svc.cluster.local:8000" # -- Whether the connection from propeller to the agent service should use TLS. insecure: true # -- The task types supported by the default agent. # Only check for status every 2 minutes pollInterval: 120s webApi: caching: resyncInterval: 120s agents:

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

thomas-maschler avatar Mar 05 '25 12:03 thomas-maschler

@thomas-maschler I've been doing some work on this recently and have successfully been able to configure agent/connector update frequency using the webApi.caching.resyncInterval value.

My understanding (based on the discussion here and the code comments) is that:

  1. The resyncInterval affects how often to poll an agent's 'get' method. Tasks that are waiting on a status resolution are queued to an internal rate limiting queue at the resyncInterval frequency.
  2. The read/writeRateLimiter settings then affect how fast to process items from this queue. (if I've misunderstood, definitely let me know!)

Unfortunately as discussed above, this setting only works at the hierarchy level where it is applied to all agents. It would be extremely useful to be able to configure this per task type.

charliemoriarty avatar May 19 '25 07:05 charliemoriarty

#take

popojk avatar May 23 '25 01:05 popojk

"Hello 👋, this issue has been inactive for over 90 days. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏"

github-actions[bot] avatar Aug 22 '25 00:08 github-actions[bot]

Hello 👋, this issue has been inactive for over 90 days and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

github-actions[bot] avatar Aug 30 '25 00:08 github-actions[bot]