zenml
zenml copied to clipboard
Implement Cleanup of Temporary Directories for built-in Materializers
Open Source Contributors Welcomed!
Please comment below if you would like to work on this issue!
Contact Details [Optional]
What happened?
Some materializers in ZenML create local temporary directories to handle files from the artifact store. These directories are not being cleaned up automatically, leading to potential clutter and storage issues, especially when large or numerous temporary files are involved.
Task Description
Develop a solution within ZenML to ensure that temporary directories created by materializers are cleaned up efficiently after their use. This could involve direct cleanup mechanisms within materializers or a centralized approach, such as a dedicated temporary directory for ZenML step runs, which is automatically cleared post-execution.
Expected Outcome
- Temporary directories used by materializers should be cleaned up automatically to avoid unnecessary storage use.
- The solution should handle scenarios where temporary files need to persist for the duration of the Python object's lifecycle but should be cleared once no longer needed.
- This enhancement will improve resource management and maintain cleanliness in the file system.
Steps to Implement
- Analyze the current implementation of materializers and identify how and where temporary directories are being used.
- Design and implement a cleanup mechanism that is triggered post usage of these temporary directories.
- Consider the creation of a dedicated ZenML temporary directory for step runs, which is automatically cleared after a step is completed.
- Test the implementation to ensure that temporary files are appropriately cleared without affecting the functionality of materializers.
- Update documentation to inform users about this behavior and any potential impact on their existing workflows.
Additional Context
Ensuring efficient cleanup of temporary directories is crucial for sustainable resource management and to maintain optimal performance in data pipelines managed by ZenML.
Code of Conduct
- [ ] I agree to follow this project's Code of Conduct
Can you assign this to me please
Is there a documented reproduction case for this?
Analyze the current implementation of materializers and identify how and where temporary directories are being used.
I've examined all of the materializers in zenml/src/zenml/materializers, and here's what I've found:
- 100% of their file operations are through the artifact store
- None of the builtin materializers make temporary directories, either using the artifact store or otherwise
Do we have a list of the materializers that are engaging in the offensive behavior? Without knowing how they are creating the temporary directories, designing a solution is quite difficult.
Most of the 'real' materializers that people use are in src/zenml/integrations and then inside the various integrations. I'd check those out to see a variety of other behaviors.
Looks like these are the integrations that create temporary objects in their materializer
zenml/src/zenml/integrations/bentoml
zenml/src/zenml/integrations/huggingface
zenml/src/zenml/integrations/lightgbm
zenml/src/zenml/integrations/llama_index
zenml/src/zenml/integrations/pillow
zenml/src/zenml/integrations/polars
zenml/src/zenml/integrations/pycaret
zenml/src/zenml/integrations/tensorflow
zenml/src/zenml/integrations/whylogs
zenml/src/zenml/integrations/xgboost
I'll take a look
Most of those materializers are already using the tempfile module, they just aren't cleaning up after themselves. (Some are doing it, some are inconsistent within the same module, others have logic that could leave temporary files dangling depending on code path or if exceptions are encountered). Changing the implementations to use the tempfile classes as context handlers is sufficient for the average case here.
However there are a few corner cases:
- huggingface_dataset_materializer intentionally does not clean up after itself (see #1135 ) because something about that integration requires the original path that materialized the dataset to be present as it moves through various processing stages. There is likely additional work to be done in this integration to resolve the way those references are created/updated/deleted but I don't know enough about it yet to suggest the right changes.
- the llama_index materializer is 100% commented out. Is this worth fixing?
- both of the lightgbm materializer save() methods use the tempfile context handler but explicitly disables delete-on-exit functionality. Looking at #569 I'm not clear why it needs to be this way. Shouldn't it be safe to delete the original file as soon as it's been copied into the artifact store?
I have a patch that fixes the common cases, I'll work that into a PR and improve the tests where possible. Can anyone answer for these 3 special case integrations?
@akesterson sorry for the delay.
- I think you can leave the HF dataset materializer to one side as a light exception, for the reasons you mention.
- we're handling llama_index on our side, and as you say it's non-functional at the moment anyway
- I also don't see a reason why the lightgbm shouldn't allow for delete on exit in the context handler, and I'm not sure why the keras materializer you mention handles it that way, but I think you should first try to do it within the context handler and if for some reason it doesn't work, then handle it explicitly. But yeah I don't see why we can't delete the file after it's been copied.
Thanks!
Sorry for the long delay here. PR is up. There are some comments around the best way to test.
https://github.com/zenml-io/zenml/pull/2560
I discovered another materializer that is a special case. The tensorflow dataset materializer intentionally does not clean up after itself in the load() stage, because the dataset is lazily loaded from the generated temporary directory. A decision should be made as to whether or not these files are actually temporary, or if there needs to be something else done in the artifact storage for this pattern.