Danielsola/se 187 experiment with auto cache versioning v2
Why are the changes needed?
Automatic cache versioning might make UX easier in that cache versions don't need to be bumped manually each time a task is changed. Forgetting to manually change the cache version could result in confusing behavior as Flyte will hit the cache and not execute new code.
This code was thrown together to prototype the idea - its a bit messy and could be optimized a lot. Would like to get thoughts and opinions from the community here. There're a lot of considerations including non-python dependencies but this might be a starting point for scoping out this feature.
What changes were proposed in this pull request?
Added a new argument to the task decorator with is a AutoCacheConfig:
@dataclass
class AutoCacheConfig:
check_task: bool = False
check_packages: bool = False
check_modules: bool = False
image_spec: Optional[ImageSpec] = None
This let's you determine what is checked when automatically generating a cache version for a task.
check_task -> this just checks the code contents of the task ignoring any comments or formatting changes
check_packages -> this traverses the imports in the task and checks the package version of any externally imported packages
check_modules -> this traverses the imports in the task and checks the contents of any modules that were imported and are defined in the current repo
image_spec -> this checks the tag of an imagespec
How was this patch tested?
Test case has this structure:
.
├── __init__.py
├── example_wf.py
├── something_to_import.py
└── something_to_import_2.py
# example_wf.py
from flytekit import task, workflow
from flytekit.auto_cache.auto_cache import AutoCacheConfig
auto_cache_config = AutoCacheConfig(<some values here>)
@task(auto_cache_config=AutoCacheConfig(check_task=True, check_modules=True, check_packages=True))
def some_task():
import pandas as pd
from something_to_import import my_function
# some random comment
my_function()
print(pd.DataFrame())
@workflow
def wf():
some_task()
if __name__ == "__main__":
wf()
# something_to_import.py
from something_to_import_2 import VAL, my_other_function
def my_function():
print("Hello", VAL)
my_other_function()
# something_to_import_2.py
import numpy as np
VAL = 2
def my_other_function():
print(np.array([1,2,3]))
Some example of cache version given different conditions:
- With
auto_cache_config = AutoCacheConfig():
No matter the change to any module, the cache version doesn't change.
- With
auto_cache_config = AutoCacheConfig(check_task=True):
No matter the change to something_to_import.py or something_to_import_2.py the config version doesn't change. The following changes to the task affect the cache version:
@task(auto_cache_config=auto_cache_config)
def some_task():
import pandas as pd
from something_to_import import my_function
# some random comment
my_function()
print(pd.DataFrame())
config version = 346f4b191fa0471e397496636b1b5df04cbbb3332eb9847edc9aa5b435883109
@task(auto_cache_config=auto_cache_config)
def some_task():
import pandas as pd
from something_to_import import my_function
# some random comment with formatting change
my_function()
print(pd.DataFrame())
config version = 346f4b191fa0471e397496636b1b5df04cbbb3332eb9847edc9aa5b435883109 (same as original)
@task(auto_cache_config=auto_cache_config)
def some_task():
import pandas as pd
from something_to_import import my_function
# some random comment
my_function()
print(pd.DataFrame())
print("new code")
config version = 5b6d0622517b6d68bd162bc49ac0eeede688579eaa8b1c5e9230e7d06d14ad73 (different than original)
The above behavior will be consistent no matter the other arguments to AutoCacheConfig.
- With
auto_cache_config = AutoCacheConfig(check_modules=True):
Changes to something_to_import.py and something_to_import_2.py will affect cache version (formatting does have an affect right now).
# something_to_import.py
from something_to_import_2 import VAL, my_other_function
def my_function():
print("Hello", VAL)
my_other_function()
# something_to_import_2.py
import numpy as np
VAL = 2
def my_other_function():
print(np.array([1,2,3]))
config version = 30ad245f8785b25188f303a199d5155c552ab4b08274c8f9dc858dab12f04513
# something_to_import.py
from something_to_import_2 import VAL, my_other_function
def my_function():
print("New Print Statement", VAL)
my_other_function()
config version = e07532a55732b01cd71821d6731c5596660d88ce40ecb68505a17c8c5fc30275 (different than original)
# something_to_import_2.py
import numpy as np
VAL = 10000 # new number
def my_other_function():
print(np.array([1,2,3]))
config version = d6747baae014c66c14d320e8d02d374409668df25da97df79c144f5687a055b6 (different than original)
The above behavior will be consistent no matter the other arguments to AutoCacheConfig.
- With
auto_cache_config = AutoCacheConfig(check_packages=True):
Changes to imported packages affect cache version:
pip freeze | grep -E 'pandas|numpy'
numpy==1.26.2
pandas==2.1.2
config version = 0fc6300d289f6cfbc66179dc07c48fe5ede106f9e29455a83733892e2b7981cc
pip install pandas==2.2.2
pip freeze | grep -E 'pandas|numpy'
numpy==1.26.2
pandas==2.2.2
config version = e3d453686c0ea4f60a1839a026f4613b5946f8518ec8e5c8ac7c219d727bc369 (different than original)
pip install pandas==2.1.2
pip install numpy==1.26.3
pip freeze | grep -E 'pandas|numpy'
numpy==1.26.3
pandas==2.1.2
config version = 0fc6300d289f6cfbc66179dc07c48fe5ede106f9e29455a83733892e2b7981cc (different than original)
The above behavior will be consistent no matter the other arguments to AutoCacheConfig.
- With
auto_cache_config = AutoCacheConfig(image_spec=True):
image_spec = ImageSpec(
name="test-image",
packages=["pandas", "numpy"],
registry="ghcr.io/dansola",
)
auto_cache_config = AutoCacheConfig(image_spec=image_spec)
config version = 04f01bb5e35b47eee2a928c125f1ccb8ab65ca17a61859b38b64e5bfde17e823
image_spec = ImageSpec(
name="test-image",
packages=["pandas==2.2.2", "numpy"], # pin pandas version
registry="ghcr.io/dansola",
)
auto_cache_config = AutoCacheConfig(image_spec=image_spec)
config version = 336a800c4d1c16726b6e1de10c28e60264d1c58d587302c3a5f2b918da53f3b0 (different than original)
The above behavior will be consistent no matter the other arguments to AutoCacheConfig.
See the auto_cache_examples dir for the examples used here.
Screenshots
Check all the applicable boxes
- [ ] I updated the documentation accordingly.
- [ ] All new and existing tests passed.
- [ ] All commits are signed-off.
Related PRs
Docs link
This is pretty cool. There's enough interest in a feature like this that we have at least 3 different PRs tackling this problem at different levels (See the linked PRs in the original issue).
@dansola , do you want to take a stab at writing an RFC before deciding on one approach?