asdf icon indicating copy to clipboard operation
asdf copied to clipboard

Customize paths of external blocks

Open rossant opened this issue 8 years ago • 11 comments

Is it possible to customize the filenames and subdirectories where the exploded block files are saved?

rossant avatar Jul 22 '15 16:07 rossant

Not presently with the explode command, but there's nothing about the file format that would prevent it.

Can you describe in more detail what you'd like to do?

mdboom avatar Jul 22 '15 16:07 mdboom

I wonder, for the sake of consistency/sanity, the ASDF standard shouldn't specify a default naming scheme for the files produced by "exploding" a file to exploded form, while giving libraries the option to use a different scheme (left up to the implementation) if requested by the user.

embray avatar Jul 22 '15 18:07 embray

while giving libraries the option to use a different scheme (left up to the implementation) if requested by the user.

do you intend to do that in pyasdf?

rossant avatar Jul 22 '15 18:07 rossant

Would the specification of a destination pattern be enough? For example:

some_directory/{source}_{block_no}.asdf

where {source} is replaced with the original root filename, and {block_no} is replaced with the block number?

By this convention, the current behavior would be defined as {source}{block_no}.asdf.

mdboom avatar Jul 22 '15 18:07 mdboom

That's sort of what I was thinking too. If just a directory destination is given it could use the default pattern. But allowing a user-specified pattern (including the directory) would work too.

embray avatar Jul 22 '15 18:07 embray

actually in our case it would be more complicated, since we'd want to use a subdirectory structure based on the hierarchy in the Tree

rossant avatar Jul 22 '15 18:07 rossant

Can you describe your use case in more detail? I think that may break down if data in a block is shared between multiple arrays in the tree.

mdboom avatar Jul 22 '15 18:07 mdboom

I think writing out individual child-objects in a hierarchical data structure is a different use case than what exploded form is for.

embray avatar Jul 22 '15 19:07 embray

To make a FITS analogy, exploded form is (somewhat) like writing the FITS header and the binary data to separate files. Whereas I think what @rossant is asking is more akin to writing each HDU to a separate file (albeit with a directory structure representing hierarchy that doesn't exist in FITS, but may in ASDF). That may be a little too application specific, but sounds worth talking about.

embray avatar Jul 22 '15 19:07 embray

Long story short, we're looking for a format for neurophysiology data that enables easy discovery of key data arrays. For a given dataset, we have a hierarchy of data arrays, but only 1 or 2 are used by 95% of our users. Having explicit names for the files would let a typical user find these important arrays easily.

Here's an example. You're a typical user, you have a dataset, and you don't know anything about the format. You see a subdirectory named spike_times containing a binary array and a metadata JSON file with the array's information (dtype, shape, etc.). Then you should be able to open that array with no difficulty in any programming language (typically MATLAB, which is still one of the dominant languages in the community...)

So far we've been using HDF5, but we're having way too many problems. Accessibility is bad; you need an HDF5 library in order to see what's in a file, whereas a text metadata file can be viewed by anyone, and a flat binary file can be opened easily in any language.

We were about to create our own custom format, but then we discovered ASDF which is pretty close to what we need. The two main differences are directory structure and YAML, which seems basically unsupported in MATLAB.

rossant avatar Jul 22 '15 21:07 rossant

I did a quick looking around and came up with at least a couple YAML interfaces for MATLAB that use LibYAML wrapped in an MEX binary. But I'm guessing your point is that MATLAB has JSON support out of the box (I don't know)?

That said, I think with a YAML interface that a rudimentary ASDF reader in MATLAB could be achieved pretty easily. We also have plans for a C implementation of ASDF on the horizon, which could be added to MATLAB via the same approach.

Getting back to your specific use case though, it does make a lot of sense. However, even in the "exploded" form the individual binary blocks have a block header of I think about 40 bytes, so your user would still have to know at least enough to offset the array after that header.

The "exploded form" was not really meant for this case--I think (and @mdboom can expand) it is more of a performance trick. For example if an application has to stream some data to the end of a table that's embedded in an ASDF file, it can first "explode" the file so that the binary block containing the table is in a file by itself, and can be streamed to directly without having to shift around the rest of the file. But once the writing is done the full file can then be reassembled. There is also a kind of "streaming" block for this use case, which carries with the the restriction that no other blocks can follow it in the file.

That said, there might be a case for including simple instructions somewhere for manually reading the array data in an ASDF header, and translating that to reading the array in from the binary block. What do you think? It would be great to get the neuroscience community using ASDF--we have them to thank for matplotlib too by way of John Hunter :)

embray avatar Jul 23 '15 14:07 embray