flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[WIP] Improve Copilot

Open machichima opened this issue 6 months ago • 4 comments

Tracking issue

Why are the changes needed?

When running ContainerTask, we discovered that the pod stuck on uploader container for a long time after the main task succeed. The uploader should not take so much time. This is because uploader sidecar starts after the task container finishes, so that it cannot detect any active process (the main container already completed), and it waits until timeout (1m40s).

image

What changes were proposed in this pull request?

Convert the uploader container from "main container" to "sidecar container" (which is the init container with restartPolicy set to Always). This has following benefit:

  1. Ensure the uploader container starts before main container
  2. We can trap the SIGTERM signal to know when the main container completed. Therefore, we do not need to check if the main container start
    • Ref: https://kubernetes.io/docs/tutorials/configuration/pod-sidecar-containers/#benefits-of-a-built-in-sidecar-container

In the PR, I made following modification:

  1. Change uploader container to "sidecar contianer"
  2. Add a new watcher: SignalWatcher
  3. Remove the WaitToStart logic for uploader
  4. Use WaitToExit in Signal Watcher to check if the main container completed
  5. Remove kube api watcher, file watcher
  6. Let copilot use the logger level set in config

How was this patch tested?

Labels

Please add one or more of the following labels to categorize your PR:

  • added: For new features.
  • changed: For changes in existing functionality.
  • deprecated: For soon-to-be-removed features.
  • removed: For features being removed.
  • fixed: For any bug fixed.
  • security: In case of vulnerabilities

This is important to improve the readability of release notes.

Setup process

Screenshots

With my modification, it only takes 3 seconds to upload the result:

image

Check all the applicable boxes

  • [ ] I updated the documentation accordingly.
  • [ ] All new and existing tests passed.
  • [x] All commits are signed-off.

Related PRs

Docs link

machichima avatar Jun 15 '25 05:06 machichima

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

flyte-bot avatar Jun 15 '25 05:06 flyte-bot

Codecov Report

Attention: Patch coverage is 35.48387% with 20 lines in your changes missing coverage. Please review.

Project coverage is 58.67%. Comparing base (9c536fb) to head (d483259). Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...lytecopilot/cmd/containerwatcher/signal_watcher.go 0.00% 20 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6501      +/-   ##
==========================================
+ Coverage   58.52%   58.67%   +0.14%     
==========================================
  Files         940      938       -2     
  Lines       71532    71361     -171     
==========================================
+ Hits        41864    41869       +5     
+ Misses      26489    26313     -176     
  Partials     3179     3179              
Flag Coverage Δ
unittests-datacatalog 59.03% <ø> (ø)
unittests-flyteadmin 56.25% <ø> (ø)
unittests-flytecopilot 39.56% <0.00%> (+8.56%) :arrow_up:
unittests-flytectl 64.72% <ø> (ø)
unittests-flyteidl 76.12% <ø> (ø)
unittests-flyteplugins 61.09% <100.00%> (+0.01%) :arrow_up:
unittests-flytepropeller 54.83% <ø> (ø)
unittests-flytestdlib 64.02% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 15 '25 05:06 codecov[bot]

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

flyte-bot avatar Jun 15 '25 06:06 flyte-bot

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

flyte-bot avatar Jun 15 '25 09:06 flyte-bot

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

flyte-bot avatar Jun 19 '25 03:06 flyte-bot

Check the behaviour when main container OOM, the sidecar container will still get SIGTERM and upload output if exists

image

machichima avatar Jun 20 '25 15:06 machichima

Also removed the shared namespace watcher as this watcher does not work in the current approach. It will be terminated by SIGTERM

image

machichima avatar Jun 21 '25 14:06 machichima

This is great will simplify copilot a lot, make it faster and more secure

kumare3 avatar Jun 24 '25 14:06 kumare3

Sorry for the late reply!

I just updated the description to include the test script I used and the new result screenshot

Thanks!

machichima avatar Jun 26 '25 05:06 machichima