Watch agent metadata service
Tracking issue
https://github.com/flyteorg/flyte/issues/3936
Why are the changes needed?
Monitor the agent metadata service to sync the metadata (support_task_type, task_type_version, is_sync, etc.) so the Agent doesn't need to be started before Propeller.
What changes were proposed in this pull request?
Add a watcher to the agent plugin to periodically sync the metadata.
- If the agent doesn't implement metadata service, the plugin will use metadata (support task type) in the config map
- if Propeller gets the metadata from the agent, this metadata will override the existing metadata in the config map.
How was this patch tested?
Setup process
agent-service:
defaultAgent:
defaultTimeout: 10s
endpoint: localhost:8005
insecure: true
agents:
myAgent:
endpoint: localhost:8000
insecure: true
Screenshots
The agent is not stated. propeller sends the request to the default agent localhost:8005
Start running agent. propeller sends the request to the default myAgent localhost:8000
Check all the applicable boxes
- [x] I updated the documentation accordingly.
- [x] All new and existing tests passed.
- [x] All commits are signed-off.
Related PRs
NA
Docs link
NA
Codecov Report
Attention: Patch coverage is 47.61905% with 33 lines in your changes missing coverage. Please review.
Project coverage is 60.98%. Comparing base (
38883c7) to head (e91de3b). Report is 137 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #5017 +/- ##
==========================================
- Coverage 60.99% 60.98% -0.02%
==========================================
Files 793 793
Lines 51295 51313 +18
==========================================
+ Hits 31289 31294 +5
- Misses 17125 17134 +9
- Partials 2881 2885 +4
| Flag | Coverage Δ | |
|---|---|---|
| unittests-datacatalog | 69.31% <ø> (ø) |
|
| unittests-flyteadmin | 58.66% <ø> (+0.02%) |
:arrow_up: |
| unittests-flytecopilot | 17.79% <ø> (ø) |
|
| unittests-flytectl | 67.97% <ø> (ø) |
|
| unittests-flyteidl | 79.04% <ø> (ø) |
|
| unittests-flyteplugins | 61.84% <47.61%> (-0.10%) |
:arrow_down: |
| unittests-flytepropeller | 57.32% <ø> (ø) |
|
| unittests-flytestdlib | 65.82% <ø> (+0.02%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Wow, the watcher is amazing, will do more test about it.
cc @honnix , this is the current watcher implementation. To save your time, let me give you a summary of how this watcher works.
- It will not panic even the agent server is not started.
- It will update
AgentRegistryevery 10 seconds, so it is available that you can start the agent server after propeller started - you sill needs to write
supportedTaskTypein your configmap, since propeller will not wait your agent server give you allsupportedTaskType
Does it look well to you?
Do you have any thoughts to enable the propeller to get supportedTaskType after agent servers start?
Thank you for your patience ~ ! : )
@Future-Outlier Thank you for summarizing. This approach looks good to me.
However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?
@Future-Outlier Thank you for summarizing. This approach looks good to me.
However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?
@honnix
Yes, you are correct, sorry for this breaking change.
Do you think it is not ok to have this breaking change?
Maybe I can discuss with Kevin about integrating the breaking change to agent/plugin.go.
However, the code will be a little bit confusing.
Are you okay with the impact of this breaking change?
@Future-Outlier Thank you for summarizing. This approach looks good to me. However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?
@honnix Yes, you are correct, sorry for this breaking change. Do you think it is not ok to have this breaking change? Maybe I can discuss with Kevin about integrating the breaking change to
agent/plugin.go. However, the code will be a little bit confusing. Are you okay with the impact of this breaking change?
@Future-Outlier Thank you for the confirmation. No it is not a problem for us, and I think in general it looks like a good change (being explicit is good). As long as it is noted in the release notes, I think it should be fine.
@Future-Outlier Thank you for summarizing. This approach looks good to me. However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?
@honnix Yes, you are correct, sorry for this breaking change. Do you think it is not ok to have this breaking change? Maybe I can discuss with Kevin about integrating the breaking change to
agent/plugin.go. However, the code will be a little bit confusing. Are you okay with the impact of this breaking change?@Future-Outlier Thank you for the confirmation. No it is not a problem for us, and I think in general it looks like a good change (being explicit is good). As long as it is noted in the release notes, I think it should be fine.
@honnix , thank you, we will try to add more information in the release notes. Also, there's a change about readiness probe for the agent pod, which means that in the upcoming release, agent pod needs to have health check servicer. (It can be turned off by the configmap) flyte config map: https://github.com/flyteorg/flyte/pull/4990 flytekit agent server implementation: https://github.com/flyteorg/flytekit/pull/2232
If there's anything we can help, please leave comments, thank you!
@Future-Outlier Thank you for the heads-up. If I understood correctly, this only matters if we deploy the agent service as a pod.
@Future-Outlier Thank you for the heads-up. If I understood correctly, this only matters if we deploy the agent service as a pod.
Yes, you are right!
you sill needs to write supportedTaskType in your configmap, since propeller will not wait your agent server give you all supportedTaskType
I'm going to update this PR, so we're also able to reload support task type at runtime.
Any plan to get this going? Thanks.
Any plan to get this going? Thanks.
Hi, sorry for the late reply, I will discuss this with Kevin and update our discussion this week, thank you for your patience!
Any plan to get this going? Thanks.
Hi, @honnix Kevin is busy with other issues now. He says he can list some details for me, and I can try implementing this. However, I am busy with school homework now, and I can come back after 5/12. Will it be too late for you?
Any plan to get this going? Thanks.
Hi, @honnix Kevin is busy with other issues now. He says he can list some details for me, and I can try implementing this. However, I am busy with school homework now, and I can come back after 5/12. Will it be too late for you?
Hi @Future-Outlier . Thank you for the update. This is not urgent so please take your time.
Any plan to get this going? Thanks.
Hi, @honnix Kevin is busy with other issues now. He says he can list some details for me, and I can try implementing this. However, I am busy with school homework now, and I can come back after 5/12. Will it be too late for you?
Hi @Future-Outlier . Thank you for the update. This is not urgent so please take your time.
Doing it, will make some variables from private to public so that we can dynamically update support task types, thank you for waiting, will ask for your review after finish it.
To give a real world incident: we just had a flytepropeller startup trouble due to one agent service being unavailable. This feature will be highly appreciated.
To give a real world incident: we just had a flytepropeller startup trouble due to one agent service being unavailable. This feature will be highly appreciated.
No problem, fighting on it with Kevin.