DataSets.jl
DataSets.jl copied to clipboard
Creating, deleting and updating datasets
We need some programmatic way to create datasets, to update their metadata and to delete them. Currently people need to manage this manually by writing TOML but clearly this isn't great.
API musings
One possibility is to overload the dataset() function itself with the ability to create a dataset. For example adding a create=true flag:
dataset("SomeData", create=true, tags=[...], description="some desc", other_args...)
dataset(project, "SomeData", create=true, tags=[...], description="some desc", other_args...)
Another idea would be to pass a verb along as a positional argument, such as
dataset("SomeData", :create; description="some desc", other_args...)
dataset("SomeData", :delete)
dataset("SomeData", :update, description="new desc")
With :read being the default verb. This allows us to reuse the exported dataset() function for all dataset-related CRUD operations.
But let's be honest this is little weird other than being economical with exported names. Perhaps I've been doing too much REST recently :-) Probably a better alternative would be to just have a function per operation:
DataSets.create("SomeData"; description="some desc", other_args...)
DataSets.delete("SomeData")
DataSets.update("SomeData", description="new desc")
update() is a bit of an odd one out of these operations — what if you wanted to delete some metadata? I guess we could pass something like description=nothing for deleting metadata items.
Which data project?
When creating a dataset it needs to be created within "some" data project. Presumably this would be the topmost project in the data project stack, or within a provided project if the project is supplied as the first argument.
Data ownership
Creation — and especially deletion — brings up an additional problem: How do we distinguish between data which is "owned" by a data project (so that the data itself should be deleted when the dataset is removed from the project), vs data which is merely linked to?
For existing data referenced on the filesystem this is particularly relevant. We don't want datasets() to delete somebody's existing data which they're referring to. But neither do we want DataSets.delete() to leave unwanted data lying around.
I think we should have an extra metadata key to distinguish between data which is managed-vs-linked-to by DataSets. Perhaps under the keys linked, or managed or some such. (Should this go within the storage section or not?)
This is something I would love to have! Manually writing updating TOML feels hackish and unreproducible at the moment. The create, delete and update syntax seems the best to me - I'd rather these operations be explicit.
Love the questions being asked here, but I would add another related to
When creating a dataset it needs to be created within "some" data project. Presumably this would be the topmost project in the data project stack, or within a provided project if the project is supplied as the first argument.
Should it data projects be made more transparent as well?
While I know the functions DataSets.ActiveDataProject & DataSets.DataProject are provided I honestly did not think about the concept of a Data Project when first using this package. Maybe something in the Data REPL to show the active project (My Data Project) data> would make this more obvious. Maybe we also provide a Data REPL command to list the ones DataSets.jl knows are available.
| command | alias | description |
|---|---|---|
projects |
proj |
list all available data projects |
project $name |
proj $name |
switch to $name data project |
Maybe something in the Data REPL to show the active project
We have this — I guess it's just badly named:
data> stack list
DataSets.StackedDataProject:
DataSets.ActiveDataProject:
(empty)
DataSets.TomlFileDataProject [/home/chris/.julia/datasets/Data.toml]:
📁 SomeDir => 302a6dd6-d9e1-4487-8919-c520f08165be
📄 SomeFile => 97633d9c-afa8-4437-abd9-320cb4fdb270
📁 TrueFX => aa21c966-563e-42fb-ac3d-edaa3bdf3652
📁 imagenet => e73ae172-eeb0-4417-b3e1-007d42918752
Alternatively, we could make data> ls just show the full stack in this format by default? (The downside there is that duplicate names can occur with the topmost data project taking precedence. Which is why I used the current format for ls where deduplication has already happened.)
Current data REPL docs do mention this, and the stack command is findable via tab completion. But clearly it should be more discoverable, somehow.
data> ?
DataSets Data REPL
====================
Press > to enter the data repl. Press TAB to complete commands.
Command Alias Action
–––––––––––––––– ––––––– –––––––––––––––––––––––––––––––––––––––––––––––––––
help ? Show this message
list ls List all datasets by name
show $name Preview the content of dataset $name
stack st Manipulate the global data search stack
stack list st ls List all projects in the global data search stack
stack push $path st push Add data project $path to front of the search stack
stack pop st pop Remove data project from front of the search stack
This works perfectly. My tired eyes / brain just looked right over it. Thanks for clarifying!