DataSets.jl Creating, deleting and updating datasets

We need some programmatic way to create datasets, to update their metadata and to delete them. Currently people need to manage this manually by writing TOML but clearly this isn't great.

API musings

One possibility is to overload the dataset() function itself with the ability to create a dataset. For example adding a create=true flag:

dataset("SomeData", create=true, tags=[...], description="some desc", other_args...)
dataset(project, "SomeData", create=true, tags=[...], description="some desc", other_args...)

Another idea would be to pass a verb along as a positional argument, such as

dataset("SomeData", :create; description="some desc", other_args...)
dataset("SomeData", :delete)
dataset("SomeData", :update, description="new desc")

With :read being the default verb. This allows us to reuse the exported dataset() function for all dataset-related CRUD operations.

But let's be honest this is little weird other than being economical with exported names. Perhaps I've been doing too much REST recently :-) Probably a better alternative would be to just have a function per operation:

DataSets.create("SomeData"; description="some desc", other_args...)
DataSets.delete("SomeData")
DataSets.update("SomeData", description="new desc")

update() is a bit of an odd one out of these operations — what if you wanted to delete some metadata? I guess we could pass something like description=nothing for deleting metadata items.

Which data project?

When creating a dataset it needs to be created within "some" data project. Presumably this would be the topmost project in the data project stack, or within a provided project if the project is supplied as the first argument.

Data ownership

Creation — and especially deletion — brings up an additional problem: How do we distinguish between data which is "owned" by a data project (so that the data itself should be deleted when the dataset is removed from the project), vs data which is merely linked to?

For existing data referenced on the filesystem this is particularly relevant. We don't want datasets() to delete somebody's existing data which they're referring to. But neither do we want DataSets.delete() to leave unwanted data lying around.

I think we should have an extra metadata key to distinguish between data which is managed-vs-linked-to by DataSets. Perhaps under the keys linked, or managed or some such. (Should this go within the storage section or not?)

Nov 16 '21 08:11 c42f

This is something I would love to have! Manually writing updating TOML feels hackish and unreproducible at the moment. The create, delete and update syntax seems the best to me - I'd rather these operations be explicit.

Jan 13 '22 22:01 tclements

Love the questions being asked here, but I would add another related to

When creating a dataset it needs to be created within "some" data project. Presumably this would be the topmost project in the data project stack, or within a provided project if the project is supplied as the first argument.

Should it data projects be made more transparent as well?

While I know the functions DataSets.ActiveDataProject & DataSets.DataProject are provided I honestly did not think about the concept of a Data Project when first using this package. Maybe something in the Data REPL to show the active project (My Data Project) data> would make this more obvious. Maybe we also provide a Data REPL command to list the ones DataSets.jl knows are available.

command	alias	description
`projects`	`proj`	list all available data projects
`project $name`	`proj $name`	switch to $name data project

Mar 31 '22 01:03 jvaverka

Maybe something in the Data REPL to show the active project

We have this — I guess it's just badly named:

data> stack list
DataSets.StackedDataProject:
  DataSets.ActiveDataProject:
    (empty)
  DataSets.TomlFileDataProject [/home/chris/.julia/datasets/Data.toml]:
    📁 SomeDir    => 302a6dd6-d9e1-4487-8919-c520f08165be
    📄 SomeFile   => 97633d9c-afa8-4437-abd9-320cb4fdb270
    📁 TrueFX     => aa21c966-563e-42fb-ac3d-edaa3bdf3652
    📁 imagenet   => e73ae172-eeb0-4417-b3e1-007d42918752

Alternatively, we could make data> ls just show the full stack in this format by default? (The downside there is that duplicate names can occur with the topmost data project taking precedence. Which is why I used the current format for ls where deduplication has already happened.)

Current data REPL docs do mention this, and the stack command is findable via tab completion. But clearly it should be more discoverable, somehow.

data> ?
  DataSets Data REPL
  ====================

  Press > to enter the data repl. Press TAB to complete commands.

  Command          Alias   Action                                             
  –––––––––––––––– ––––––– –––––––––––––––––––––––––––––––––––––––––––––––––––
  help             ?       Show this message                                  
  list             ls      List all datasets by name                          
  show $name               Preview the content of dataset $name               
  stack            st      Manipulate the global data search stack            
  stack list       st ls   List all projects in the global data search stack  
  stack push $path st push Add data project $path to front of the search stack
  stack pop        st pop  Remove data project from front of the search stack

Mar 31 '22 03:03 c42f

This works perfectly. My tired eyes / brain just looked right over it. Thanks for clarifying!

Mar 31 '22 04:03 jvaverka

DataSets.jl DataSets.jl copied to clipboard

Creating, deleting and updating datasets

API musings

Which data project?

Data ownership

DataSets.jl
DataSets.jl copied to clipboard