flyte icon indicating copy to clipboard operation
flyte copied to clipboard

Watch agent metadata service

Open pingsutw opened this issue 1 year ago • 15 comments

Tracking issue

https://github.com/flyteorg/flyte/issues/3936

Why are the changes needed?

Monitor the agent metadata service to sync the metadata (support_task_type, task_type_version, is_sync, etc.) so the Agent doesn't need to be started before Propeller.

What changes were proposed in this pull request?

Add a watcher to the agent plugin to periodically sync the metadata.

  • If the agent doesn't implement metadata service, the plugin will use metadata (support task type) in the config map
  • if Propeller gets the metadata from the agent, this metadata will override the existing metadata in the config map.

How was this patch tested?

Setup process

  agent-service:
    defaultAgent:
      defaultTimeout: 10s
      endpoint: localhost:8005
      insecure: true
    agents:
      myAgent:
        endpoint: localhost:8000
        insecure: true

Screenshots

The agent is not stated. propeller sends the request to the default agent localhost:8005

image

Start running agent. propeller sends the request to the default myAgent localhost:8000

image

Check all the applicable boxes

  • [x] I updated the documentation accordingly.
  • [x] All new and existing tests passed.
  • [x] All commits are signed-off.

Related PRs

NA

Docs link

NA

pingsutw avatar Mar 06 '24 22:03 pingsutw

Codecov Report

Attention: Patch coverage is 47.61905% with 33 lines in your changes missing coverage. Please review.

Project coverage is 60.98%. Comparing base (38883c7) to head (e91de3b). Report is 137 commits behind head on master.

Files with missing lines Patch % Lines
...yteplugins/go/tasks/plugins/webapi/agent/client.go 50.00% 14 Missing and 3 partials :warning:
...yteplugins/go/tasks/plugins/webapi/agent/plugin.go 44.82% 16 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5017      +/-   ##
==========================================
- Coverage   60.99%   60.98%   -0.02%     
==========================================
  Files         793      793              
  Lines       51295    51313      +18     
==========================================
+ Hits        31289    31294       +5     
- Misses      17125    17134       +9     
- Partials     2881     2885       +4     
Flag Coverage Δ
unittests-datacatalog 69.31% <ø> (ø)
unittests-flyteadmin 58.66% <ø> (+0.02%) :arrow_up:
unittests-flytecopilot 17.79% <ø> (ø)
unittests-flytectl 67.97% <ø> (ø)
unittests-flyteidl 79.04% <ø> (ø)
unittests-flyteplugins 61.84% <47.61%> (-0.10%) :arrow_down:
unittests-flytepropeller 57.32% <ø> (ø)
unittests-flytestdlib 65.82% <ø> (+0.02%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 06 '24 23:03 codecov[bot]

Wow, the watcher is amazing, will do more test about it. image

Future-Outlier avatar Mar 07 '24 04:03 Future-Outlier

cc @honnix , this is the current watcher implementation. To save your time, let me give you a summary of how this watcher works.

  • It will not panic even the agent server is not started.
  • It will update AgentRegistry every 10 seconds, so it is available that you can start the agent server after propeller started
  • you sill needs to write supportedTaskType in your configmap, since propeller will not wait your agent server give you all supportedTaskType

Does it look well to you? Do you have any thoughts to enable the propeller to get supportedTaskType after agent servers start? Thank you for your patience ~ ! : )

Future-Outlier avatar Mar 07 '24 15:03 Future-Outlier

@Future-Outlier Thank you for summarizing. This approach looks good to me.

However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

honnix avatar Mar 08 '24 17:03 honnix

@Future-Outlier Thank you for summarizing. This approach looks good to me.

However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

@honnix Yes, you are correct, sorry for this breaking change. Do you think it is not ok to have this breaking change? Maybe I can discuss with Kevin about integrating the breaking change to agent/plugin.go. However, the code will be a little bit confusing. Are you okay with the impact of this breaking change?

Future-Outlier avatar Mar 09 '24 07:03 Future-Outlier

@Future-Outlier Thank you for summarizing. This approach looks good to me. However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

@honnix Yes, you are correct, sorry for this breaking change. Do you think it is not ok to have this breaking change? Maybe I can discuss with Kevin about integrating the breaking change to agent/plugin.go. However, the code will be a little bit confusing. Are you okay with the impact of this breaking change?

@Future-Outlier Thank you for the confirmation. No it is not a problem for us, and I think in general it looks like a good change (being explicit is good). As long as it is noted in the release notes, I think it should be fine.

honnix avatar Mar 09 '24 09:03 honnix

@Future-Outlier Thank you for summarizing. This approach looks good to me. However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

@honnix Yes, you are correct, sorry for this breaking change. Do you think it is not ok to have this breaking change? Maybe I can discuss with Kevin about integrating the breaking change to agent/plugin.go. However, the code will be a little bit confusing. Are you okay with the impact of this breaking change?

@Future-Outlier Thank you for the confirmation. No it is not a problem for us, and I think in general it looks like a good change (being explicit is good). As long as it is noted in the release notes, I think it should be fine.

@honnix , thank you, we will try to add more information in the release notes. Also, there's a change about readiness probe for the agent pod, which means that in the upcoming release, agent pod needs to have health check servicer. (It can be turned off by the configmap) flyte config map: https://github.com/flyteorg/flyte/pull/4990 flytekit agent server implementation: https://github.com/flyteorg/flytekit/pull/2232

If there's anything we can help, please leave comments, thank you!

Future-Outlier avatar Mar 09 '24 12:03 Future-Outlier

@Future-Outlier Thank you for the heads-up. If I understood correctly, this only matters if we deploy the agent service as a pod.

honnix avatar Mar 11 '24 11:03 honnix

@Future-Outlier Thank you for the heads-up. If I understood correctly, this only matters if we deploy the agent service as a pod.

Yes, you are right!

Future-Outlier avatar Mar 11 '24 12:03 Future-Outlier

you sill needs to write supportedTaskType in your configmap, since propeller will not wait your agent server give you all supportedTaskType

I'm going to update this PR, so we're also able to reload support task type at runtime.

pingsutw avatar Mar 11 '24 17:03 pingsutw

Any plan to get this going? Thanks.

honnix avatar Apr 12 '24 07:04 honnix

Any plan to get this going? Thanks.

Hi, sorry for the late reply, I will discuss this with Kevin and update our discussion this week, thank you for your patience!

Future-Outlier avatar Apr 15 '24 14:04 Future-Outlier

Any plan to get this going? Thanks.

Hi, @honnix Kevin is busy with other issues now. He says he can list some details for me, and I can try implementing this. However, I am busy with school homework now, and I can come back after 5/12. Will it be too late for you?

Future-Outlier avatar Apr 23 '24 04:04 Future-Outlier

Any plan to get this going? Thanks.

Hi, @honnix Kevin is busy with other issues now. He says he can list some details for me, and I can try implementing this. However, I am busy with school homework now, and I can come back after 5/12. Will it be too late for you?

Hi @Future-Outlier . Thank you for the update. This is not urgent so please take your time.

honnix avatar Apr 23 '24 07:04 honnix

Any plan to get this going? Thanks.

Hi, @honnix Kevin is busy with other issues now. He says he can list some details for me, and I can try implementing this. However, I am busy with school homework now, and I can come back after 5/12. Will it be too late for you?

Hi @Future-Outlier . Thank you for the update. This is not urgent so please take your time.

Doing it, will make some variables from private to public so that we can dynamically update support task types, thank you for waiting, will ask for your review after finish it.

Future-Outlier avatar May 19 '24 03:05 Future-Outlier

To give a real world incident: we just had a flytepropeller startup trouble due to one agent service being unavailable. This feature will be highly appreciated.

honnix avatar Jun 07 '24 15:06 honnix

To give a real world incident: we just had a flytepropeller startup trouble due to one agent service being unavailable. This feature will be highly appreciated.

No problem, fighting on it with Kevin.

Future-Outlier avatar Jun 08 '24 00:06 Future-Outlier