zenml Implement Cleanup of Temporary Directories for built-in Materializers

trafficstars

Open Source Contributors Welcomed!

Please comment below if you would like to work on this issue!

Contact Details [Optional]

What happened?

Some materializers in ZenML create local temporary directories to handle files from the artifact store. These directories are not being cleaned up automatically, leading to potential clutter and storage issues, especially when large or numerous temporary files are involved.

Task Description

Develop a solution within ZenML to ensure that temporary directories created by materializers are cleaned up efficiently after their use. This could involve direct cleanup mechanisms within materializers or a centralized approach, such as a dedicated temporary directory for ZenML step runs, which is automatically cleared post-execution.

Expected Outcome

Temporary directories used by materializers should be cleaned up automatically to avoid unnecessary storage use.
The solution should handle scenarios where temporary files need to persist for the duration of the Python object's lifecycle but should be cleared once no longer needed.
This enhancement will improve resource management and maintain cleanliness in the file system.

Steps to Implement

Analyze the current implementation of materializers and identify how and where temporary directories are being used.
Design and implement a cleanup mechanism that is triggered post usage of these temporary directories.
Consider the creation of a dedicated ZenML temporary directory for step runs, which is automatically cleared after a step is completed.
Test the implementation to ensure that temporary files are appropriately cleared without affecting the functionality of materializers.
Update documentation to inform users about this behavior and any potential impact on their existing workflows.

Additional Context

Ensuring efficient cleanup of temporary directories is crucial for sustainable resource management and to maintain optimal performance in data pipelines managed by ZenML.

Code of Conduct

[ ] I agree to follow this project's Code of Conduct

Jan 10 '24 08:01 strickvl

Can you assign this to me please

Mar 11 '24 11:03 akesterson

Is there a documented reproduction case for this?

Analyze the current implementation of materializers and identify how and where temporary directories are being used.

I've examined all of the materializers in zenml/src/zenml/materializers, and here's what I've found:

100% of their file operations are through the artifact store
None of the builtin materializers make temporary directories, either using the artifact store or otherwise

Do we have a list of the materializers that are engaging in the offensive behavior? Without knowing how they are creating the temporary directories, designing a solution is quite difficult.

Mar 11 '24 12:03 akesterson

Most of the 'real' materializers that people use are in src/zenml/integrations and then inside the various integrations. I'd check those out to see a variety of other behaviors.

Mar 11 '24 12:03 strickvl

Looks like these are the integrations that create temporary objects in their materializer

zenml/src/zenml/integrations/bentoml
zenml/src/zenml/integrations/huggingface
zenml/src/zenml/integrations/lightgbm
zenml/src/zenml/integrations/llama_index
zenml/src/zenml/integrations/pillow
zenml/src/zenml/integrations/polars
zenml/src/zenml/integrations/pycaret
zenml/src/zenml/integrations/tensorflow
zenml/src/zenml/integrations/whylogs
zenml/src/zenml/integrations/xgboost

I'll take a look

Mar 11 '24 14:03 akesterson

Most of those materializers are already using the tempfile module, they just aren't cleaning up after themselves. (Some are doing it, some are inconsistent within the same module, others have logic that could leave temporary files dangling depending on code path or if exceptions are encountered). Changing the implementations to use the tempfile classes as context handlers is sufficient for the average case here.

However there are a few corner cases:

huggingface_dataset_materializer intentionally does not clean up after itself (see #1135 ) because something about that integration requires the original path that materialized the dataset to be present as it moves through various processing stages. There is likely additional work to be done in this integration to resolve the way those references are created/updated/deleted but I don't know enough about it yet to suggest the right changes.
the llama_index materializer is 100% commented out. Is this worth fixing?
both of the lightgbm materializer save() methods use the tempfile context handler but explicitly disables delete-on-exit functionality. Looking at #569 I'm not clear why it needs to be this way. Shouldn't it be safe to delete the original file as soon as it's been copied into the artifact store?

I have a patch that fixes the common cases, I'll work that into a PR and improve the tests where possible. Can anyone answer for these 3 special case integrations?

Mar 12 '24 01:03 akesterson

@akesterson sorry for the delay.

I think you can leave the HF dataset materializer to one side as a light exception, for the reasons you mention.
we're handling llama_index on our side, and as you say it's non-functional at the moment anyway
I also don't see a reason why the lightgbm shouldn't allow for delete on exit in the context handler, and I'm not sure why the keras materializer you mention handles it that way, but I think you should first try to do it within the context handler and if for some reason it doesn't work, then handle it explicitly. But yeah I don't see why we can't delete the file after it's been copied.

Thanks!

Mar 15 '24 09:03 strickvl

Sorry for the long delay here. PR is up. There are some comments around the best way to test.

https://github.com/zenml-io/zenml/pull/2560

Mar 22 '24 13:03 akesterson

I discovered another materializer that is a special case. The tensorflow dataset materializer intentionally does not clean up after itself in the load() stage, because the dataset is lazily loaded from the generated temporary directory. A decision should be made as to whether or not these files are actually temporary, or if there needs to be something else done in the artifact storage for this pattern.

Mar 23 '24 11:03 akesterson

zenml zenml copied to clipboard

Implement Cleanup of Temporary Directories for built-in Materializers

Open Source Contributors Welcomed!

Contact Details [Optional]

What happened?

Task Description

Expected Outcome

Steps to Implement

Additional Context

Code of Conduct

zenml
zenml copied to clipboard