DataSets.jl The road to DataSets 1.0

The road to DataSets 1.0

Open c42f opened this issue 3 years ago • 1 comments

Here's a rough list of items I'm considering on the path to a DataSets-1.0 release. Several of these can and should be done prior to version 1.0 in case the APIs need to be adjusted a bit before the 1.0 release.

[ ] Streamline access for small datasets by providing a "high level" API for use when working with a fully in-memory representation of the data which doesn't require the management of separate resources. ("Separate resources" would be things like managing an on-disk cache of the data, incremental async download/upload; that kind of thing.). Perhaps we can use the verbs load() / save() for this — thinking of DataSets.jl as a new FileIO.jl, I think this would make sense. (Actually, this isn't breaking, so it doesn't need to wait for 1.0.)
[ ] Somehow allow load() and save() to return some "default type the user cares about" for convenience. For example, returning a DataFrame for a tabular dataset. This will require addressing the problems of dynamically loading Julia modules that were partially faced in #17
[ ] Consider the fate of dataset() and open() — currently the open(dataset(...)) idiom is a bit of an awkward double step and leads to some ambiguities. Perhaps we could repurpose dataset(name) to mean what open(dataset(name)) currently does?
[ ] Perhaps unexport DataSet? Users should rarely need to use this directly.
[ ] Storage API; finalize how we're going to deal with "resources" which back a lazily downloaded dataset: cache mangement, etc. We could adopt the approach from ResourceContexts.jl, for example using ctx = ResourceContext(); x = dataset(ctx, "name"); ...; close(ctx). Or from ContextManagers.jl in the style ctx = dataset("name"); x = value(ctx); close(ctx). (Both of these have macros for syntactic shortcuts.)
- #27
- #38
[ ] Improve and formalize the BlobTree API
- #41
- API improvements from #38
- #42
[ ] Figure out how we can integrate with FilePathsBase and whether there's a type which can implement the AbstractPath interface well enough to allow things like CSV.read(x) to work for some x. Perhaps we need a DataSpecification type for the URI-like concept currently called "dataspec" in the codebase? We could have CSV.read(data"foo?version=2#a/b")?
[ ] Consider deprecating and removing the "data entry point" stuff @datarun and @datafunc. I feel introducing these was premature and the semantics is probably not quite right. We can bring something similar back in the future if it seems like a good idea.
[ ] Fix some issues with Data.toml
- [ ] Consider representing [datasets] section as a dictionary mapping names to configs, not as an array with name properties. This is safe because TOML syntax does allow arbitrary strings as section names. (Note that either representation is valid when a given DataSet is specifically tied to a project.)
- [ ] Move data storage driver type outside of the storage section?
- [x] Fix up the mess with @__DIR__ templating somehow (fixed in #46)
[ ] Dataset resolution
- [ ] Rename DataSets.PROJECT to DataSets.PROJECTS if this is always a StackedDataProject.
- [ ] Consider whether we really want a data stack vs how "data authorities" could perhaps work (ie, the authority section in the URI; eg, juliahub.com)

May 06 '22 06:05 c42f

For dynamically loading Julia modules, in MLDatasets.jl we now use the @lazy import macro from LazyModules and our own @require import macro (similar to @lazy but requiring the user to add the import to its code).

May 21 '22 06:05 CarloLucibello

DataSets.jl DataSets.jl copied to clipboard

The road to DataSets 1.0

DataSets.jl
DataSets.jl copied to clipboard