Teach serializer functions to write to persistent storage
The serializer functions under tiled/structures/*.py create an in-memory buffer and return (or, after https://github.com/bluesky/tiled/issues/70, maybe yield) bytes. For async transcoding, they need to learn how to (optionally) write to persistent storage instead. Here is one way we could do this:
def serialize_csv(df, metadata):
...
becomes
def serialize_csv(df, metadata, file=None):
if file is None:
# current behavior, returns bytes for an immediate HTTP response
else:
# writes bytes into file, potentially in chunks
# so that the entire dataset is never in RAM at once
file.write(...)
As a performance optimization, the file should also be allowed to be a filepath because some writers (like h5py) can go faster if they can pass a filepath down to the C layer rather than working with a buffer in the Python layer. I have a gut feeling we should also accept writeable buffers for file but I am having trouble specifically articulating why.
However, #70 complicates this because the serializers are allowed to sometimes be generators. Let's wait on #70 and revisit.
To finish putting my thoughts on paper here:
Making these serializer functions play two different roles (return/yield bytes vs write bytes into file) is not my favorite thing. An alternative is to grow two registries: one for in-memory serializers and one for persistent serializers. At first, I don't love that either, for its extra conceptual complexity and the likelihood of code duplication between the two types of serializer. But, as I sit with it, I think it may be the lesser evil. Better to have two separate functions with well-defined input and return values than to have one functions with behavior that is hard to immediately follow or reason about.
Elaborating on the above:
def serialize_csv_sync(df, metadata):
...
yield chunk_of_bytes
async def serialize_csv_async(df, metadata, file):
...
file.write(...)
# return None
serialization_registry.register("array", "text/csv", serialize_csv_sync, serialize_csv_async)
The tiled server is implementing a feature that each serialize method will need to be able to handle in order for Tiled to work. Making file an optional parameter in this case feel awkward. Completely separate serialize methods (serailize_csv_memory and serialize_csv_file) would be clearer to those implementing these methods but require more code, since in a great many cases these methods would do a lot of the same thing.
On the whole I think I prefer separate methods.
I like the register method in https://github.com/bluesky/tiled/issues/128#issuecomment-961398197... It really tells the implementer that they have to implement both to be compliant.
Another point we should consider in this API design:
- What about formats like Zarr and TileDB that span multiple files? This is another reason, along with performance optimization for
h5py, thatfileshould maybe befilepath. - What about blob storage someday?