mleap
mleap copied to clipboard
Is it possible to share data between multiple transformer instances?
Hello mleap experts,
I have built a custom transformer which maps a key to vector with a Map, but the scale is not small ~100K, the custom transformer is used multiple times in same mleap pipeline, they are serialized separately causing the underling Map duplicated. I am wondering if it's possible that multiple transformer instance share the same underlying data, so that I could only store one copy in bundle file, and store one copy in memory shared by multiple instances.
Definitely you can have the transformer instances share the map state. E.g., store the map as part of a companion object.
Storing the map only once in the bundle file is trickier. I think it can be done but would be kind of ugly. Maybe something like add+store a "shouldWriteMap" parameter on the transformer which you set to true on exactly one instance of the transformer in your pipeline. It might be easier to just store the map multiple times within the bundle.
Another option which you could consider to make your transformer be multiple input/output so that you only need to use the transformer one time in your pipeline.
thanks @jsleight for quick response!
the map is not small compared with the overall bundle file, the most size portion is due to the map duplicates.
store the map as part of a companion object.
this is a good idea that i can use to reduce memory footprint, I could keep a another map to store different embeddings and keep each one for exact one copy, and then load the duplicated instance, just point to the map in the object.
adding a shouldWriteMap
parameter will make ML team a little harder to use it, they need maintain the flag once. I am thinking that if it's possible to store the common data in the root, and make multiple instances to point to that common data, but this seems break the mleap serialization design philosophy?
Update the transformer to be multiple input/output is also one solution, but I may prefer to see if I could update the serialization/deserialization internally to achieve the goal as they are already a bunch of clients code using the custom transformer.
I checked around the mleap code, it seems I can customize the single transformer serialization with store()
but cannot customize the overall node serialization to make some data sharing logic.
Yeah to my knowledge mleap APIs don't really have a good mechanism for storing global state in the bundle. Though I wouldn't be opposed to adding such capabilities if you want to submit a PR.
Perhaps by adding new APIs for writeGlobal and readGlobal (or something like that) which the Ops can use. Probably we would need to rely on transformers providing unique keys in the global bundle namespace, but I think that should be acceptable.