jams icon indicating copy to clipboard operation
jams copied to clipboard

A schema for collections?

Open bmcfee opened this issue 9 years ago • 6 comments

Going back to this comment, we punted on the idea of managing extrinsic data (eg, file paths) explicitly from within a JAMS object. Now that the dust has settled a bit on JAMS schema, I'm wondering if we can come up with a better solution than sandboxing this stuff.

I bring this up because maintaining links between audio content and annotations is still kind of a pain, and I'd prefer to not solve it over and over again.

How do people feel about introducing an interface/schema for managing collections of jamses? At the most basic level, this would provide a simple index of audio content, jams content, and collection-level information. (It might also be useful to index which annotation namespaces are present in each jams file.) This kind of thing can spiral out of control easily, so if we do it, we should keep it tightly scoped.

bmcfee avatar Jun 09 '15 14:06 bmcfee

How's about a FileManager object that inherits from a dict or list, depending on whether or not key or integer-based indexing makes sense (I typically use, and prefer, key-based indexing so you're robust to shuffling / partitioning), and contains a FileCollection, consisting of fields which point to any number of file paths.

As an added bonus, we / the user could register different load / open methods with filetypes for transparent (lazy) loading, i.e. "npz" -> np.load, "jams" -> jams.load, etc. For example...

fmgr = FileManager()
fmgr['my_song'] = FileCollection(
    audio='/path/to/my/song.wav', 
    annotation='/a/different/file.jams',
    features='/data/features/my_song.npz')

# Assuming 'npz' -> np.load by default
data = fmgr['my_song'].features.load()

Additionally, if everything inherits from JObject, then this database-style object can be saved / loaded just as easily.

Thoughts?

ejhumphrey avatar Jul 14 '15 11:07 ejhumphrey

How's about a FileManager object that inherits from a dict or list, depending on whether or not key or integer-based indexing makes sense (I typically use, and prefer, key-based indexing so you're robust to shuffling / partitioning)

I'd argue that int-based indexing never makes sense, unless the int is actually treated as a key (eg in gtzan).

It may also be worth looking at something like asdf for inspiration, since they have many of the same problems we do.

As an added bonus, we / the user could register different load / open methods with filetypes for transparent (lazy) loading, i.e. "npz" -> np.load, "jams" -> jams.load, etc.

I like this idea, but transparent loading seems a little tricky to get exactly right. Ideally, I'd want to be able to clobber load arguments (such as audio sampling rate). This could be supported pretty easily by setting defaults on kwargs, but the resulting api may be kind of a mess.

Maybe we should ponder on that a bit.

bmcfee avatar Jul 14 '15 13:07 bmcfee

Circling back on this after a bit of pondering.

 fmgr = FileManager()
 fmgr['my_song'] = FileCollection(
     audio='/path/to/my/song.wav', 
     annotation='/a/different/file.jams',
     features='/data/features/my_song.npz')

This looks exactly like a dataframe to me.

 # Assuming 'npz' -> np.load by default
 data = fmgr['my_song'].features.load()

How about something a little less objecty? I like your idea of having a dispatch object that can map a key (eg features) to a loader function (np.load). Why does that need to be attached to the object? We could just as easily construct the dispatcher as an object, and feed it a data frame where keys correspond to samples, and each column is a field that can be loaded via dispatch.

This way, we don't have to worry about schematizing the whole thing, and it becomes much easier to import data sets on the fly. (We can also tag along non-loadable fields at the same level, such as an artist id for split filtering.)

bmcfee avatar Sep 14 '15 15:09 bmcfee

Punting this to #98

bmcfee avatar Feb 01 '16 16:02 bmcfee

Having thought on this for years at this point, I think the reasonable course of action here is as follows:

  1. Implement the unified schema refactor proposed in #178
  2. Expose the schema over the web with proper versioning and references.
  3. Any collection-level schema can be built using references to (2). Objects can be sharded and linked by uuids at the collection-level, but the objects themselves do not need to contain identifiers. This way, the schema can be backward-compatible.
  4. Provide a standard implementation / example schema for managing jams collections in mongodb (the famed jamongo) using the above.

bmcfee avatar May 31 '18 15:05 bmcfee

Provide a standard implementation / example schema for managing jams collections in mongodb (the famed jamongo) using the above.

Of course, it couldn't be that simple. MongoDB does not support $ref in json schema (?!).

bmcfee avatar Jun 05 '18 13:06 bmcfee