[Core Feature]: Allow to configure agent status update frequency
I have an asynchronous agent that schedules long-running jobs using a third-party service. That service has request quotas. Right now, agents request a status update every 5s. When running many long-running tasks concurrently I frequently run into quota limits and my status updates fail.
I would like to configure the status update frequency for my agent in a similar way I can configure timeouts. https://docs.flyte.org/en/latest/flyte_agents/developing_agents.html#canary-deployment
Thank you for opening your first issue here! 🛠
You should be able to configure it by updating this config
@thomas-maschler as @pingsutw said, this is already supported in Agents today. please use this to adjust. @pingsutw maybe we should make it possible to adjust it from the AgentMetadata service (in the future)?
@pingsutw following up on this thread.
I tried updating our config and used webApi.caching.resyncInterval to change the update frequency. That didn't work and my agent still polled every ~10sec.
flyteagent:
enabled: true
plugin_config:
plugins:
agent-service:
defaultAgent:
endpoint: "dns:///flyteagent.flyte-core.svc.cluster.local:8000"
insecure: true
agents:
my-custom-agent:
endpoint: "dns:///my-custom-agent.my-namespace.svc.cluster.local:8000"
insecure: true
timeouts:
CreateTask: 5s
GetTask: 30s
DeleteTask: 30s
defaultTimeout: 30s
# Only run the agent `get` method once a minute
webApi:
caching:
resyncInterval: 60s
I also tried using pollInterval without success
flyteagent:
enabled: true
plugin_config:
plugins:
agent-service:
defaultAgent:
endpoint: "dns:///flyteagent.flyte-core.svc.cluster.local:8000"
insecure: true
agents:
my-custom-agent:
endpoint: "dns:///my-custom-agent.my-namespace.svc.cluster.local:8000"
insecure: true
timeouts:
CreateTask: 5s
GetTask: 30s
DeleteTask: 30s
defaultTimeout: 30s
# Only run the agent `get` method once a minute
pollInterval: 60s
Can you post an example on how to configure the canary agent correctly to poll less frequently?
I think the problem with both configs is that they are one level too deep. Looking at the docs both should be set at the agent level like this:
flyteagent:
enabled: true
plugin_config:
plugins:
# -- Agent service configuration for propeller.
agent-service:
# -- The default agent service to use for plugin tasks.
defaultAgent:
# -- The agent service endpoint propeller should connect to.
endpoint: "dns:///flyteagent.flyte-core.svc.cluster.local:8000"
# -- Whether the connection from propeller to the agent service should use TLS.
insecure: true
# -- The task types supported by the default agent.
# Only check for status every 2 minutes
pollInterval: 120s
webApi:
caching:
resyncInterval: 120s
agents:
The idea was to only throttle the canary deployments, not the rest.I talked to someone on the Flyte Slack and apparently both configs do something else. We would need to modify webApi: readRateLimiter: burst: 0 qps: 0.01666666666But floats are not yet supported. However there is a PR.https://flyte-org.slack.com/archives/C06SYN9QJ5N/p1741109569518379?thread_ts=1741109569.518379&cid=C06SYN9QJ5NI am working around this now by logging the last sync time inside my agent and only execute my code if enough time has passed.https://hello.planet.com/code/planetary-variables/forests/ascilero/-/merge_requests/6On Mar 5, 2025 03:32, Christoph Paulik @.> wrote:
I think the problem with both configs is that they are one level too deep. Looking at the docs both should be set at the agent level like this:
flyteagent:
enabled: true
plugin_config:
plugins:
# -- Agent service configuration for propeller.
agent-service:
# -- The default agent service to use for plugin tasks.
defaultAgent:
# -- The agent service endpoint propeller should connect to.
endpoint: "dns:///flyteagent.flyte-core.svc.cluster.local:8000"
# -- Whether the connection from propeller to the agent service should use TLS.
insecure: true
# -- The task types supported by the default agent.
# Only check for status every 2 minutes
pollInterval: 120s
webApi:
caching:
resyncInterval: 120s
agents:
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>
cpaulik left a comment (flyteorg/flyte#5733) I think the problem with both configs is that they are one level too deep. Looking at the docs both should be set at the agent level like this: flyteagent: enabled: true plugin_config: plugins: # -- Agent service configuration for propeller. agent-service: # -- The default agent service to use for plugin tasks. defaultAgent: # -- The agent service endpoint propeller should connect to. endpoint: "dns:///flyteagent.flyte-core.svc.cluster.local:8000" # -- Whether the connection from propeller to the agent service should use TLS. insecure: true # -- The task types supported by the default agent. # Only check for status every 2 minutes pollInterval: 120s webApi: caching: resyncInterval: 120s agents:
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
@thomas-maschler I've been doing some work on this recently and have successfully been able to configure agent/connector update frequency using the webApi.caching.resyncInterval value.
My understanding (based on the discussion here and the code comments) is that:
- The
resyncIntervalaffects how often to poll an agent's 'get' method. Tasks that are waiting on a status resolution are queued to an internal rate limiting queue at the resyncInterval frequency. - The read/writeRateLimiter settings then affect how fast to process items from this queue. (if I've misunderstood, definitely let me know!)
Unfortunately as discussed above, this setting only works at the hierarchy level where it is applied to all agents. It would be extremely useful to be able to configure this per task type.
#take
"Hello 👋, this issue has been inactive for over 90 days. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏"
Hello 👋, this issue has been inactive for over 90 days and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏