flytekit icon indicating copy to clipboard operation
flytekit copied to clipboard

Danielsola/se 187 experiment with auto cache versioning v2

Open dansola opened this issue 1 year ago • 1 comments

Why are the changes needed?

Automatic cache versioning might make UX easier in that cache versions don't need to be bumped manually each time a task is changed. Forgetting to manually change the cache version could result in confusing behavior as Flyte will hit the cache and not execute new code.

This code was thrown together to prototype the idea - its a bit messy and could be optimized a lot. Would like to get thoughts and opinions from the community here. There're a lot of considerations including non-python dependencies but this might be a starting point for scoping out this feature.

What changes were proposed in this pull request?

Added a new argument to the task decorator with is a AutoCacheConfig:

@dataclass
class AutoCacheConfig:
    check_task: bool = False
    check_packages: bool = False
    check_modules: bool = False
    image_spec: Optional[ImageSpec] = None

This let's you determine what is checked when automatically generating a cache version for a task.

check_task -> this just checks the code contents of the task ignoring any comments or formatting changes

check_packages -> this traverses the imports in the task and checks the package version of any externally imported packages

check_modules -> this traverses the imports in the task and checks the contents of any modules that were imported and are defined in the current repo

image_spec -> this checks the tag of an imagespec

How was this patch tested?

Test case has this structure:

.
├── __init__.py
├── example_wf.py
├── something_to_import.py
└── something_to_import_2.py
# example_wf.py

from flytekit import task, workflow
from flytekit.auto_cache.auto_cache import AutoCacheConfig

auto_cache_config = AutoCacheConfig(<some values here>)

@task(auto_cache_config=AutoCacheConfig(check_task=True, check_modules=True, check_packages=True))
def some_task():
    import pandas as pd
    from something_to_import import my_function

    # some random comment
    my_function()
    print(pd.DataFrame())


@workflow
def wf():
    some_task()


if __name__ == "__main__":
    wf()
# something_to_import.py

from something_to_import_2 import VAL, my_other_function


def my_function():
    print("Hello", VAL)
    my_other_function()
# something_to_import_2.py

import numpy as np

VAL = 2

def my_other_function():
    print(np.array([1,2,3]))

Some example of cache version given different conditions:

  1. With auto_cache_config = AutoCacheConfig():

No matter the change to any module, the cache version doesn't change.

  1. With auto_cache_config = AutoCacheConfig(check_task=True):

No matter the change to something_to_import.py or something_to_import_2.py the config version doesn't change. The following changes to the task affect the cache version:

@task(auto_cache_config=auto_cache_config)
def some_task():
    import pandas as pd
    from something_to_import import my_function

    # some random comment
    my_function()
    print(pd.DataFrame())

config version = 346f4b191fa0471e397496636b1b5df04cbbb3332eb9847edc9aa5b435883109

@task(auto_cache_config=auto_cache_config)
def some_task():
    import pandas as pd
    from something_to_import import my_function

    # some random comment with formatting change
    my_function()

    print(pd.DataFrame())

config version = 346f4b191fa0471e397496636b1b5df04cbbb3332eb9847edc9aa5b435883109 (same as original)

@task(auto_cache_config=auto_cache_config)
def some_task():
    import pandas as pd
    from something_to_import import my_function

    # some random comment
    my_function()
    print(pd.DataFrame())
    print("new code")

config version = 5b6d0622517b6d68bd162bc49ac0eeede688579eaa8b1c5e9230e7d06d14ad73 (different than original)

The above behavior will be consistent no matter the other arguments to AutoCacheConfig.

  1. With auto_cache_config = AutoCacheConfig(check_modules=True):

Changes to something_to_import.py and something_to_import_2.py will affect cache version (formatting does have an affect right now).

# something_to_import.py

from something_to_import_2 import VAL, my_other_function


def my_function():
    print("Hello", VAL)
    my_other_function()
# something_to_import_2.py

import numpy as np

VAL = 2

def my_other_function():
    print(np.array([1,2,3]))

config version = 30ad245f8785b25188f303a199d5155c552ab4b08274c8f9dc858dab12f04513

# something_to_import.py

from something_to_import_2 import VAL, my_other_function


def my_function():
    print("New Print Statement", VAL)
    my_other_function()

config version = e07532a55732b01cd71821d6731c5596660d88ce40ecb68505a17c8c5fc30275 (different than original)

# something_to_import_2.py

import numpy as np

VAL = 10000  # new number

def my_other_function():
    print(np.array([1,2,3]))

config version = d6747baae014c66c14d320e8d02d374409668df25da97df79c144f5687a055b6 (different than original)

The above behavior will be consistent no matter the other arguments to AutoCacheConfig.

  1. With auto_cache_config = AutoCacheConfig(check_packages=True):

Changes to imported packages affect cache version:

pip freeze | grep -E 'pandas|numpy'
numpy==1.26.2
pandas==2.1.2

config version = 0fc6300d289f6cfbc66179dc07c48fe5ede106f9e29455a83733892e2b7981cc

pip install pandas==2.2.2
pip freeze | grep -E 'pandas|numpy'
numpy==1.26.2
pandas==2.2.2

config version = e3d453686c0ea4f60a1839a026f4613b5946f8518ec8e5c8ac7c219d727bc369 (different than original)

pip install pandas==2.1.2
pip install numpy==1.26.3
pip freeze | grep -E 'pandas|numpy'
numpy==1.26.3
pandas==2.1.2

config version = 0fc6300d289f6cfbc66179dc07c48fe5ede106f9e29455a83733892e2b7981cc (different than original)

The above behavior will be consistent no matter the other arguments to AutoCacheConfig.

  1. With auto_cache_config = AutoCacheConfig(image_spec=True):
image_spec = ImageSpec(
    name="test-image",
    packages=["pandas", "numpy"],
    registry="ghcr.io/dansola",
)

auto_cache_config = AutoCacheConfig(image_spec=image_spec)

config version = 04f01bb5e35b47eee2a928c125f1ccb8ab65ca17a61859b38b64e5bfde17e823

image_spec = ImageSpec(
    name="test-image",
    packages=["pandas==2.2.2", "numpy"], # pin pandas version
    registry="ghcr.io/dansola",
)

auto_cache_config = AutoCacheConfig(image_spec=image_spec)

config version = 336a800c4d1c16726b6e1de10c28e60264d1c58d587302c3a5f2b918da53f3b0 (different than original)

The above behavior will be consistent no matter the other arguments to AutoCacheConfig.

See the auto_cache_examples dir for the examples used here.

Screenshots

Check all the applicable boxes

  • [ ] I updated the documentation accordingly.
  • [ ] All new and existing tests passed.
  • [ ] All commits are signed-off.

Related PRs

Docs link

dansola avatar Sep 12 '24 17:09 dansola

This is pretty cool. There's enough interest in a feature like this that we have at least 3 different PRs tackling this problem at different levels (See the linked PRs in the original issue).

@dansola , do you want to take a stab at writing an RFC before deciding on one approach?

eapolinario avatar Sep 16 '24 18:09 eapolinario