Implement Recursive Built-In Container Materializer

Open fa9r opened this issue 3 years ago • 0 comments

Describe changes

I revamped materializers.built_in_materializer.py to add support for bytes, set, and non-JSON-serializable dict, list, and tuple objects.

The main implication of this is that you can now use arbitrary lists/sets/tuples/sets of materializable data types in your steps without having to write a custom materializer for it.

E.g., the following data types can now be handled automatically:

Dict[str, np.ndarray],
Set[pd.DataFrame],
Dict[str, List[torch.nn.module]], if pytorch integration is installed,
Dict[str, List[MyCustomClass]], if a materializer for MyCustomClass was defined,
List[Union[Dict[str, Union[Tuple[int, float, str, bool], Dict[str, List[np.ndarray]], bytes]], Set[Union[float, int]]]],
...

Implementation Details

The original BuiltInMaterializer was split into three separate classes:

BuiltInMaterializer now only handles bool, float, int, str; otherwise its behavior is unchanged (materialization via serialization to JSON file),
bytes type is now handled by a separate BytesMaterializer since bytes is not JSON-serializable (this was broken before, seems like no one ever tried to use bytes artifacts in ZenML).
list, dict, tuple and (new) set are now handled by a separate BuiltInContainerMaterializer.

BuiltInContainerMaterializer works like this:

If the given container (dict/list/set/tuple) is JSON-serializable, write it to JSON (similar as before).
Otherwise, recursively materialize all elements in the container into a subdirectory by finding the corresponding materializer from the default_materializer_registry at runtime.
Tuples and sets are cast to list before materialization (and back to the original type after loading).
Non-serializable dicts are materialized as a list of lists [keys, values] (and reconstructed as dict after loading).

I also added 9 unit tests covering the materialization of all supported built-in data types.

Potential Issues

Materializers are strongly linked to artifacts. Thus, when creating a materializer for an element at runtime, we need to create a mock artifact in order to initialize a corresponding materializer. This might have uninteneded side-effects since it goes deep into TFX-land. Maybe @htahir1 or @bcdurak might know whether this is problematic or not?
When loading elements at run time, the data type is found by iterating through all artifact types registered in the default_materializer_registry and checking if any of them have a simiilar string representation as the type of the element had before materializing. This means, loading data will fail if any element had a type that is not explicitly registered in the registry and was instead materialized by a materializer linked to a superclass. However, this should very rarely happen in practice and is IMO a design flaw of the default_materializer_registry since implicitly using the materializer of a superclass is already quite unpredictable in itself.

Pre-requisites

Please ensure you have done the following:

[x] I have read the CONTRIBUTING.md document.
[ ] If my change requires a change to docs, I have updated the documentation accordingly.
[ ] If I have added an integration, I have updated the integrations table and the corresponding website section.
[x] I have added tests to cover my changes.

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Other (add details above)

Jul 29 '22 10:07 fa9r