prefect
prefect copied to clipboard
Allow results to be cached separately from their metadata
First check
- [X] I added a descriptive title to this issue.
- [X] I used the GitHub search to find a similar request and didn't find it.
- [X] I searched the Prefect documentation for this feature.
Prefect Version
2.x
Describe the current behavior
With the recent version it became possible to control the caching behavior. However, it seems one can not have full control over the cache writing. When the task is to be cached, we can provide the task with a WritableFileSystem
object to control where to write the results. In this, the function:
async def write_path(self, path: str, content: bytes) -> str:
is then used by the Prefect system to write the results. The content is the serialized data from PersistedResultBlob
which contains information about the serializer, data and prefect version. All this information is condensed into one blob which is then written as a single file, like for instance:
{"serializer": {"type": "pickle", "picklelib": "cloudpickle", "picklelib_version": "2.2.0"}, "data": "..."}
Describe the proposed behavior
Within the above described system, it is very hard to create a custom storage method which would store the metadata separately from the content. What I would propose is that the WritableFileSystem
method:
async def write_path(self, path: str, content: bytes) -> str:
Is provided with slightly more details about what it is storing. Perhaps something like this:
async def write_path(self, path: str, content: PersistedResultBlob) -> str:
That is, it get's the persisted result blob itself, and serializes it by itself. Or something, please think of a design that would suit your library better.
In essence, I think the actual file writers (here, the WritableFileSystem) need to know a bit more about the results to be able to separate the metadata and the content.
Example Use
The new behavior would allow more control over the cache system. This could be useful for:
- storing the results in a SQL database, where the metadata is in a different column
- storing the actual file content in a presentable file, and store the metadata in a json next to it.
- storing the results on a file system and the metadata in a database
- ...
For example, suppose my pipeline generates images. I want to both cache these and be able to view them. In the current system I would have to write them twice, once using the caching mechanism, once using a custom writer. Using my proposal it would become possible to serialize the images as a PNG and the metadata as a JSON next to it. My custom "WritableFileSystem" would connect the two files.
Additional context
Thank you for making this library. It is becoming more useful by the day.
This is a continuation of my previous ticket / comment https://github.com/PrefectHQ/prefect/issues/6397#issuecomment-1217607113 . I am hoping the Prefect library can become flexible enough to allow for varying use cases.
Any news on this?
@robbert-harms the file system objects must remain naive to the data they are receiving. Their focus is reading and writing bytes, nothing more. We intend to address this use-case with the ability to set a custom result class for your flow/task. This would give you full control over the data it writes and reads and it could certainly write to multiple locations.
Hi @madkinsz, thank you for your swift reply. I agree, that is a cleaner solution. That is basically how I handle it now in my private library. Thank you for your hard work.
You're welcome! Sorry for the brief response, lots of triage to do this morning :)
This will probably not come until a bit later, we've got a lot of work planned through the end of the year.
Hi @madkinsz, thank you for your swift reply. I agree, that is a cleaner solution. That is basically how I handle it now in my private library. Thank you for your hard work.
Hi @robbert-harms, I am also working with image data and start to look into a temporary workaround until this is fully supported from prefect. Could I have a look at your current workaround? Or maybe you can provide me a few pointers.
Hi Tibuch,
My solution is to "double wrap" everything. Every pipeline function is wrapped with a Prefect Task and a Memoize function. When the data is computed, it is first wrapped in a Result object from my personal library and this in turn is wrapped by Prefect. This is not ideal as everything needs double handling.
A slightly improved solution would be to use inheritance and inherit both the Prefect Task and Prefect Results and have the tasks return custom results objects. This would be the "inheritance" solution, but I considered this would be done cleaner by the Prefect team.
I would wait for the Prefect team to allow specifying a custom result class.
Thanks @robbert-harms for the pointers.
I developed now something which hooks into the serializer and takes care of my custom results: https://github.com/fmi-faim/custom-prefect-result
But I would like to see a native Prefect solution :innocent:
I was wondering if the ideas mentioned in https://github.com/PrefectHQ/prefect/issues/7257#issuecomment-1330833333 are still planned or actively being worked on - specifically, allowing a custom result class for your flow/task? If not, is this something you would accept from a community contribution? I assume that more discussion would be required, but I just wanted to see if there was any interest.