DataSets.jl
DataSets.jl copied to clipboard
The road to DataSets 1.0
Here's a rough list of items I'm considering on the path to a DataSets-1.0 release. Several of these can and should be done prior to version 1.0 in case the APIs need to be adjusted a bit before the 1.0 release.
- [ ] Streamline access for small datasets by providing a "high level" API for use when working with a fully in-memory representation of the data which doesn't require the management of separate resources. ("Separate resources" would be things like managing an on-disk cache of the data, incremental async download/upload; that kind of thing.). Perhaps we can use the verbs
load()/save()for this — thinking of DataSets.jl as a new FileIO.jl, I think this would make sense. (Actually, this isn't breaking, so it doesn't need to wait for 1.0.) - [ ] Somehow allow
load()andsave()to return some "default type the user cares about" for convenience. For example, returning aDataFramefor a tabular dataset. This will require addressing the problems of dynamically loading Julia modules that were partially faced in #17 - [ ] Consider the fate of
dataset()andopen()— currently theopen(dataset(...))idiom is a bit of an awkward double step and leads to some ambiguities. Perhaps we could repurposedataset(name)to mean whatopen(dataset(name))currently does? - [ ] Perhaps unexport
DataSet? Users should rarely need to use this directly. - [ ] Storage API; finalize how we're going to deal with "resources" which back a lazily downloaded dataset: cache mangement, etc. We could adopt the approach from ResourceContexts.jl, for example using
ctx = ResourceContext(); x = dataset(ctx, "name"); ...; close(ctx). Or from ContextManagers.jl in the stylectx = dataset("name"); x = value(ctx); close(ctx). (Both of these have macros for syntactic shortcuts.)- #27
- #38
- [ ] Improve and formalize the
BlobTreeAPI- #41
- API improvements from #38
- #42
- [ ] Figure out how we can integrate with
FilePathsBaseand whether there's a type which can implement theAbstractPathinterface well enough to allow things likeCSV.read(x)to work for somex. Perhaps we need aDataSpecificationtype for the URI-like concept currently called "dataspec" in the codebase? We could haveCSV.read(data"foo?version=2#a/b")? - [ ] Consider deprecating and removing the "data entry point" stuff
@datarunand@datafunc. I feel introducing these was premature and the semantics is probably not quite right. We can bring something similar back in the future if it seems like a good idea. - [ ] Fix some issues with Data.toml
- [ ] Consider representing
[datasets]section as a dictionary mapping names to configs, not as an array withnameproperties. This is safe becauseTOMLsyntax does allow arbitrary strings as section names. (Note that either representation is valid when a givenDataSetis specifically tied to a project.) - [ ] Move data storage driver type outside of the storage section?
- [x] Fix up the mess with
@__DIR__templating somehow (fixed in #46)
- [ ] Consider representing
- [ ] Dataset resolution
- [ ] Rename
DataSets.PROJECTtoDataSets.PROJECTSif this is always aStackedDataProject. - [ ] Consider whether we really want a data stack vs how "data authorities" could perhaps work (ie, the authority section in the URI; eg, juliahub.com)
- [ ] Rename
For dynamically loading Julia modules, in MLDatasets.jl we now use the @lazy import macro from LazyModules and our own @require import macro (similar to @lazy but requiring the user to add the import to its code).