asdf icon indicating copy to clipboard operation
asdf copied to clipboard

Provide support for streaming data to external files

Open drdavella opened this issue 6 years ago • 7 comments

As mentioned in #638, the documentation alludes to the fact that it should be possible to stream data to external files (see here):

An ASDF writer may stream a table to disk, when the size of the table is not known at the outset. Using exploded form simplifies this, since a standalone file containing a single table can be iteratively appended to without worrying about any blocks that may follow it.

However, this is not possible with the current API. There shouldn't be any changes necessary in the standard itself or in the file format, but we ought to provide an API for this. Also, from a reading perspective, this shouldn't be any different from the existing case of "exploded" arrays.

drdavella avatar Jan 16 '19 14:01 drdavella

A potentially important use case for this would be providing the ability to stream multiple arrays in parallel. The standard allows only a single streamed array in the main file, but with external files there's no reason we shouldn't be able to stream multiple arrays, and also no reason that it wouldn't be possible to do this in parallel.

drdavella avatar Jan 16 '19 14:01 drdavella

I have been trying to work my way around this, however no success so far. My use case, I have a large gzipped csv that I would like to split into multiple leafs in an asdf tree. My idea was to use a single stream, write to file, then open a new Stream, etc..

import smart_open
import asdf
import numpy as np

with smart_open.open('output/test_5.asdf', 'wb') as fout:
    asdf_file = asdf.AsdfFile() 
    
    asdf_file.tree[idx] = asdf.Stream([7], dtype=np.float64)
    asdf_file.write_to(fout)
    idx = 0
    while True:
        if idx > 5:
            break
        
        data = np.random.random((7,1000))
        fout.write(data)
        
        idx += 1
        asdf_file.tree[idx] = asdf.Stream([7], dtype=np.float64)
        asdf_file.write_to(fout)
        print(idx)

But it seems to keep all data, but overwrite the branch label.

with asdf.open('output/test_5.asdf') as af:
    print(af.tree)
    print(af.tree[6].shape)

which returns

{'asdf_library': {'author': 'Space Telescope Science Institute', 'homepage': 'http://github.com/spacetelescope/asdf', 'name': 'asdf', 'version': '2.3.3'}, 'history': {'extensions': [<asdf.tags.core.ExtensionMetadata object at 0x7f94b757c9b0>]}, 6: <array (unloaded) shape: ['*', 7] dtype: float64>}

(6093, 7)

themmes avatar Aug 16 '19 16:08 themmes

Btw, I am using smart_open because the goal is to be using this with data from AWS S3

themmes avatar Aug 16 '19 16:08 themmes

I'll see if I can find time early next week to take a look at this. Thanks.

perrygreenfield avatar Aug 16 '19 18:08 perrygreenfield

@perrygreenfield Thanks for your fast reply, any form of streamed writing into splits would be of great help!

themmes avatar Aug 16 '19 18:08 themmes

I think I am missing some background on this issue. If you could clarify what you are trying to do in a little more detail it would help me check into this. So when you say you want to save parts of the csv file into different ASDF leaves do you want to do this simultaneously (in parallel) or sequentially? If in parallel do you know the size of each leaf before you start? It might be good to understand the constraints (e.g., the csv file is too large to fit uncompressed in memory) that requires the multiple streams.

perrygreenfield avatar Aug 16 '19 19:08 perrygreenfield

  • A gzipped larger than memory csv (uncompressed it definitely is)
  • because it is gzipped I can not write in parallel, but must write sequential

What I do not really understand is, when opening an asdf file all leaves are lazy loaded, which invites for larger-than-memory files. However, beyond the single stream there seems to be no way of writing a larger-than-memory file, which leaves the tree promise out of reach for many use cases?

themmes avatar Aug 16 '19 21:08 themmes