DataSets.jl Data Layers

Data layers allow data of different formats to be mapped into a program through a decoder and presented with a uniform API such that the main program logic can avoid dealing with data format decoding. Instead, the data format can be defined in the Data.toml.

A challenge here is dealing with world age issues which come up from dynamically requireing Julia packages. For now, we include a bit of judicious Base.invokelatest to make things "just work" in the REPL, but also warn the user that they should add a top-level import.

With this patch and the Data.toml from the tests, we can open several tabular data formats, without the user needing to know much about the data storage.

Here's an example of loading data in .tsv, .gzip.csv and .arrow formats (any of which could then be converted to a DataFrame thanks to the Tables.jl interface)

julia> @! open(dataset("table_tsv"))
┌ Warning: The package CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b] is required to load your dataset. DataSets will import this module for you, but this may not always work as
│ expected.
│ 
│ To silence this message, add import CSV at the top of your code somewhere.
└                                                            @ DataSets /home/chris/.julia/dev/DataSets/src/layers.jl:32
2-element CSV.File{false}:
 CSV.Row: (Name = "Aaron", Age = 23)
 CSV.Row: (Name = "Harry", Age = 42)

julia> @! open(dataset("table_gzip"))
┌ Warning: The package CodecZlib [944b1d66-785c-5afd-91f1-9de20f533193] is required to load your dataset. DataSets will import this module for you, but this may not always
│ work as expected.
│ 
│ To silence this message, add import CodecZlib at the top of your code somewhere.
└                                                            @ DataSets /home/chris/.julia/dev/DataSets/src/layers.jl:32
2-element CSV.File{false}:
 CSV.Row: (Name = "Aaron", Age = 23)
 CSV.Row: (Name = "Harry", Age = 42)

julia> @! open(dataset("table_arrow"))
┌ Warning: The package Arrow [69666777-d1a9-59fb-9406-91d4454c9d45] is required to load your dataset. DataSets will import this module for you, but this may not always work
│ as expected.
│ 
│ To silence this message, add import Arrow at the top of your code somewhere.
└                                                            @ DataSets /home/chris/.julia/dev/DataSets/src/layers.jl:32
Arrow.Table: (Name = ["Aaron", "Harry"], Age = [23, 42])

Excerpt from Data.toml, showing the configuration required for the system to understand these various formats:

[[datasets]]
description="Simple TSV example"
name="table_tsv"
uuid="efde65c3-a898-4ba9-97c1-45dba64b8d46"

    [datasets.storage]
    driver="FileSystem"
    type="Blob"
    path="@__DIR__/data/people.tsv"

    [[datasets.layers]]
    type = "csv"
    [datasets.layers.parameters]
        delim="\t"

[[datasets]]
description="Gzipped CSV example"
name="table_gzip"
uuid="2d126588-5f76-4e53-8245-87dc91625bf4"

    [datasets.storage]
    driver="FileSystem"
    type="Blob"
    path="@__DIR__/data/people.csv.gz"

    [[datasets.layers]]
    type = "gzip"

    [[datasets.layers]]
    type = "csv"

[[datasets]]
description="Arrow example"
name="table_arrow"
uuid="e964d100-fef2-45c4-85de-9d8e142f4084"

    [datasets.storage]
    driver="FileSystem"
    type="Blob"
    path="@__DIR__/data/people.arrow"

    [[datasets.layers]]
    type = "arrow"

More generally than tabular data, here's some further examples of data which comes encoded in many forms, but we'd like to treat through the same data loader API:

Byte streams:

raw
gzip
xz
zstd
...

Images

png
jpeg
tiff
...

Data trees

directories
zip
hdf5
...

Jun 01 '21 05:06 c42f

I thought we'd discussed not using @! and making the context explicit instead.

Jun 01 '21 21:06 StefanKarpinski

I thought we'd discussed not using @! and making the context explicit instead.

Yes, but then we decided to use finalizers instead, where possible, and not expose the context to users at all. That's what was implemented in #12 for Blob and BlobTree (which needed to become mutable as a result).

You'll note that #12 contains no mention of ResourceContexts.jl in the documentation update.

Also, the above is purely optional use of @! — explicit context passing is fine too:

ctx = ResourceContext()

data = open(ctx, dataset("table_tsv"))

Jun 02 '21 00:06 c42f

That's what was implemented in #12 for Blob and BlobTree (which needed to become mutable as a result).

Of course, the issue with the finalizer approach is that it doesn't work with some third-party types such as CSV.CSVFile, which are immutable and can't have finalizers attached. Ideas?

Jun 02 '21 00:06 c42f

Ideas?

Return a mutable wrapper object, perhaps? Either that or if the object is immutable, throw an error and require the caller to use the explicit context form (or the @! shorthand).

Jun 07 '21 13:06 StefanKarpinski

Return a mutable wrapper object, perhaps? Either that or if the object is immutable, throw an error and require the caller to use the explicit context form (or the @! shorthand).

Thanks, I think these are the options. I've been mulling it over but haven't come up with anything else yet.

With wrappers, there seems to be two alternatives

Return something very generic like Ref{T}.
- Pro: Works for all types
- Con: Doesn't have a useful API; must be unwrapped to do anything. Quite clumsy and not similar to API for types which happen to be mutable and don't need wrapping.
- Con: After unwrapping, users will want to drop the wrapper in which case their resources will be closed
Return a wrapper with the right API, for example a hypothetical WrappedTable for tabular data
- Pro: User friendly
- Con: Lots of wrappers to implement, doesn't easily scale to many disparate packages
- Con: Correct API for wrappers may be unclear. In the extreme, just an exact duplicate of the wrapped object.

All together, wrappers don't seem very appealing. I'm inclined to just error and direct the user to the explicit context-based API for the generic code path.

As a hybrid, we could implement a few wrappers for APIs which are relatively well defined and commonly used, eg, tables.

Jun 09 '21 06:06 c42f

Honestly it seems most appealing to me to just always require the context object. Once people learn to do this it will always work.

Jun 09 '21 13:06 StefanKarpinski

hmm. so I could a file that is encrypted + compressed, and layers would allow the program to peel this back to handle that on the fly? what other types of preprocessing could be layers? user-defined layers?

Aug 16 '21 14:08 layne-sadler

so I could a file that is encrypted + compressed, and layers would allow the program to peel this back to handle that on the fly?

Yes, this should be possible. I think the interesting/tricky thing here is having a way to provide parameters to layers. In particular, how would we inject the decryption keys in a secure way? I suppose these are logically a property of the DataSet, but you also don't want to leave keys lying around in memory.

what other types of preprocessing could be layers?

Anything that represents a linear pipeline of decoding stages could be represented. (Conversely, more general DAGs cannot be represented as cleanly — the whole DAG would have to be represented as single non-composable layer.)

user-defined layers?

Yes, in this PR the user should be able to define their own layer by calling DataSets.register_layer! in their third-party module (probably as part of the module's __init__ function) and defining a method with the signature open(layer::DataLayer{:users_custom_tag}, blob::Blob).

Aug 17 '21 00:08 c42f

I'll go ahead and close this PR, since I don't think we'll merge it. But the branch and discussion will stay around for future reference.

Nov 30 '23 09:11 mortenpi

DataSets.jl DataSets.jl copied to clipboard

Data Layers

DataSets.jl
DataSets.jl copied to clipboard