zenml
zenml copied to clipboard
Implement Recursive Built-In Container Materializer
Describe changes
I revamped materializers.built_in_materializer.py to add support for bytes, set, and non-JSON-serializable dict, list, and tuple objects.
The main implication of this is that you can now use arbitrary lists/sets/tuples/sets of materializable data types in your steps without having to write a custom materializer for it.
E.g., the following data types can now be handled automatically:
Dict[str, np.ndarray],Set[pd.DataFrame],Dict[str, List[torch.nn.module]], if pytorch integration is installed,Dict[str, List[MyCustomClass]], if a materializer forMyCustomClasswas defined,List[Union[Dict[str, Union[Tuple[int, float, str, bool], Dict[str, List[np.ndarray]], bytes]], Set[Union[float, int]]]],- ...
Implementation Details
The original BuiltInMaterializer was split into three separate classes:
BuiltInMaterializernow only handlesbool,float,int,str; otherwise its behavior is unchanged (materialization via serialization to JSON file),bytestype is now handled by a separateBytesMaterializersincebytesis not JSON-serializable (this was broken before, seems like no one ever tried to usebytesartifacts in ZenML).list,dict,tupleand (new)setare now handled by a separateBuiltInContainerMaterializer.
BuiltInContainerMaterializer works like this:
- If the given container (dict/list/set/tuple) is JSON-serializable, write it to JSON (similar as before).
- Otherwise, recursively materialize all elements in the container into a subdirectory by finding the corresponding materializer from the
default_materializer_registryat runtime. - Tuples and sets are cast to
listbefore materialization (and back to the original type after loading). - Non-serializable dicts are materialized as a list of lists
[keys, values](and reconstructed asdictafter loading).
I also added 9 unit tests covering the materialization of all supported built-in data types.
Potential Issues
- Materializers are strongly linked to artifacts. Thus, when creating a materializer for an element at runtime, we need to create a mock artifact in order to initialize a corresponding materializer. This might have uninteneded side-effects since it goes deep into TFX-land. Maybe @htahir1 or @bcdurak might know whether this is problematic or not?
- When loading elements at run time, the data type is found by iterating through all artifact types registered in the
default_materializer_registryand checking if any of them have a simiilar string representation as the type of the element had before materializing. This means, loading data will fail if any element had a type that is not explicitly registered in the registry and was instead materialized by a materializer linked to a superclass. However, this should very rarely happen in practice and is IMO a design flaw of thedefault_materializer_registrysince implicitly using the materializer of a superclass is already quite unpredictable in itself.
Pre-requisites
Please ensure you have done the following:
- [x] I have read the CONTRIBUTING.md document.
- [ ] If my change requires a change to docs, I have updated the documentation accordingly.
- [ ] If I have added an integration, I have updated the integrations table and the corresponding website section.
- [x] I have added tests to cover my changes.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Other (add details above)