Convenience for a script processing datasets into new dataset
Hi there,
inspired by the existing functionality of "generating a new dataset from a given pipeline" described here: https://tecosaur.github.io/DataToolkit.jl/main/tutorial/#Cleaning-the-data , and also by the DrWatson.produce_or_load functionality, I have a proposal for something that is sorts of the merging of the two.
In my work I produce a "derived" dataset, similar to what is done in the DataToolkit.jl tutorial. However, I am searching for a minimally invasive way to transform the following script, into something DataToolkit.jl compatible. Let's say I have this:
using PkgA, PkgB, ...
X = load(dataset1_path)
Y = load(dataset2_path)
...
W = product_new_dataset_from_others(X, Y, ...)
save(datasetW_path)
How do I leverage DataToolkit.jl, so that the dataset W is re-created on-demand only when any of the input datasets is modified, given this script? Let's assume that I have already transformed X, Y, ... already into DataToolkit.jl data entries, as it is clear from the docs how to do this.
It's not exactly what we talked about, but this seems like a good place to note that the API in v0.10 is beginning to make programmatic DataSet creation feasible without being an ugly mess.
Sample
const REGISTRY_URL = "https://pkg.julialang.org"
const regdata = loadcollection!(joinpath(@__DIR__, "RegistryData.toml"))
const pkgsources = dataset(regdata, "PackageSources") |> read
const pkgdata = DataCollection("PackageData", plugins = ["defaults", "store"])
for (name, uuid, url, hash) in pkgsources
pkgfiles = create!(pkgdata, DataSet, name, "description" => "The source files of the package $name.")
storage!(pkgfiles, :web, "url" => "$REGISTRY_URL/package/$uuid/$hash")
loader!(pkgfiles, :chain, "loaders" => ["gzip", "tar"])
pkgstrs = create!(pkgdata, DataSet, name * " strings",
"description" => "The strings extracted from the source files of the package $name.")
storage!(pkgstrs, :passthrough,
"source" => string(Identifier(pkgfiles)),
"type" => Dict{String, IO})
loader!(pkgstrs, :julia,
"input" => Dict{String, IO},
"path" => "Data.d/extract_string.jl",
"type" => Vector{String})
end
write(joinpath(@__DIR__, "PackageData.toml"), pkgdata)
thanks, perhaps you can attach a text description of what the script does?
Sure! That's taking a list of pkgsources (gzip'd julia package source tarballs), and for each one generating a DataSet for the untar-d content, and another DataSet for all the strings in that package, for example:
(RegistryData) data> stack list
# Name Datasets Writable Plugins
─────────────────────────────────────────────────────────────────────
1 RegistryData 2 yes cache, defaults, memorise, store
2 PackageData 19849 yes defaults, store
julia> d"DrWatson strings"
1049-element Vector{String}:
"<NAME-PLACEHOLDER>"
"dummy_src_file.jl"
"\nCurrently active project is: \$" ⋯ 192 bytes ⋯ "ening your own Pull Requests!\n"
"double"
"a=0.1535_b=5_mode=double"
"n_a=0.153_b=5_mode=double"
"n"
⋮
"."
""
"jld2"
"tmp"
"_research"
"\n tmpsave(dicts::Vector{Dict" ⋯ 635 bytes ⋯ " to wsave (e.g. compression).\n"
Sample of the generated Data TOML
[[DrWatson]]
uuid = "c36fd30f-9fa2-469d-8eb2-3a5f86ad49a6"
description = "The source files of the package DrWatson."
[[DrWatson.storage]]
driver = "web"
url = "https://pkg.julialang.org/package/634d3b9d-ee7a-5ddf-bec9-22491ea816e1/32704fb48e1ecd3739d5018df35282237b823f0a"
[[DrWatson.loader]]
driver = "chain"
loaders = ["gzip", "tar"]
[["DrWatson strings"]]
uuid = "9d4121b5-e042-40bd-839e-631dfb4f7a31"
description = "The strings extracted from the source files of the package DrWatson."
[["DrWatson strings".storage]]
driver = "passthrough"
source = "PackageData:DrWatson"
type = "Dict{String,IO}"
[["DrWatson strings".loader]]
driver = "julia"
input = "Dict{String,IO}"
path = "Data.d/extract_string.jl"
type = "Array{String,1}"
I think a layer of convenience on top of this that gets us closer to produce_or_load might be something like:
@jldataset "Name" function(a = d"input1", b = d"input2")::Int
a * b # (pure) code that produces the result
end
Allowing d"Name" to be used in subsequent code.
Copying comments in a different issue that largely duplicate this (but documented for completeness).
Reading through the documentation I wonder if there might be some things that could improve the ergonomics of processing data (I'm afraid I haven't had much chance to use DTK in anything meaningful, so these may be non-issues). I'm very happy to be told no if you feel like I'm asking for things that are outside of the intended scope, or were previously considered and discounted.
I see this package as being very useful in providing a full data-to-output pipeline, in particular, I'm interested it the ability for DTK to propagate 'invalidations' and recompute dependencies upon data changes. I haven't had a chance to dive into the code, but I'm guessing that dependencies are also recomputed if any of the underlying functions change, given the loader is saved as a string (and presumably hashed).
However, from what I can tell, the DTK and Data.toml approach (seem) to rely upon a very interactive style of development - starting the creation of a new object using make new_object in the DTK REPL. For short modifications this isn't an issue, but for more involved processing steps this may be rather involved (even wrapping everything in a package that can be @required).
Would it be possible to implement a macro that produces the same output, but instead allows the user to specify the processing steps within a file (or series of files), much like the {targets} R package? I was thinking it would be great if a user could write a script like below:
# processing-example.jl using InternalPackage: InternalPackage iris_clean = @make(InternalPackage.cleaning_step(iris)) iris_processed = @make(InternalPackage.processing_step(iris_clean))where @make transforms to the commands the REPL mode executes (repl_make()?), along with the @require calls?
In an ideal world, it would be great to be also able to have a command to check for outdated objects, though maybe that already exists?
Thanks again!