tiled icon indicating copy to clipboard operation
tiled copied to clipboard

Teach serializer functions to write to persistent storage

Open danielballan opened this issue 4 years ago • 4 comments

The serializer functions under tiled/structures/*.py create an in-memory buffer and return (or, after https://github.com/bluesky/tiled/issues/70, maybe yield) bytes. For async transcoding, they need to learn how to (optionally) write to persistent storage instead. Here is one way we could do this:

def serialize_csv(df, metadata):
    ...

becomes

def serialize_csv(df, metadata, file=None):
    if file is None:
        # current behavior, returns bytes for an immediate HTTP response
    else:
        # writes bytes into file, potentially in chunks
        # so that the entire dataset is never in RAM at once
        file.write(...)

As a performance optimization, the file should also be allowed to be a filepath because some writers (like h5py) can go faster if they can pass a filepath down to the C layer rather than working with a buffer in the Python layer. I have a gut feeling we should also accept writeable buffers for file but I am having trouble specifically articulating why.

However, #70 complicates this because the serializers are allowed to sometimes be generators. Let's wait on #70 and revisit.

danielballan avatar Nov 04 '21 11:11 danielballan

To finish putting my thoughts on paper here:

Making these serializer functions play two different roles (return/yield bytes vs write bytes into file) is not my favorite thing. An alternative is to grow two registries: one for in-memory serializers and one for persistent serializers. At first, I don't love that either, for its extra conceptual complexity and the likelihood of code duplication between the two types of serializer. But, as I sit with it, I think it may be the lesser evil. Better to have two separate functions with well-defined input and return values than to have one functions with behavior that is hard to immediately follow or reason about.

danielballan avatar Nov 04 '21 12:11 danielballan

Elaborating on the above:

def serialize_csv_sync(df, metadata):
   ...
   yield chunk_of_bytes

async def serialize_csv_async(df, metadata, file):
    ...
    file.write(...)
    # return None

serialization_registry.register("array", "text/csv", serialize_csv_sync, serialize_csv_async)

danielballan avatar Nov 04 '21 20:11 danielballan

The tiled server is implementing a feature that each serialize method will need to be able to handle in order for Tiled to work. Making file an optional parameter in this case feel awkward. Completely separate serialize methods (serailize_csv_memory and serialize_csv_file) would be clearer to those implementing these methods but require more code, since in a great many cases these methods would do a lot of the same thing.

On the whole I think I prefer separate methods.

I like the register method in https://github.com/bluesky/tiled/issues/128#issuecomment-961398197... It really tells the implementer that they have to implement both to be compliant.

dylanmcreynolds avatar Nov 04 '21 20:11 dylanmcreynolds

Another point we should consider in this API design:

  • What about formats like Zarr and TileDB that span multiple files? This is another reason, along with performance optimization for h5py, that file should maybe be filepath.
  • What about blob storage someday?

danielballan avatar Nov 04 '21 20:11 danielballan