DataToolkit.jl icon indicating copy to clipboard operation
DataToolkit.jl copied to clipboard

Convenience for a script processing datasets into new dataset

Open Datseris opened this issue 1 year ago • 4 comments

Hi there,

inspired by the existing functionality of "generating a new dataset from a given pipeline" described here: https://tecosaur.github.io/DataToolkit.jl/main/tutorial/#Cleaning-the-data , and also by the DrWatson.produce_or_load functionality, I have a proposal for something that is sorts of the merging of the two.

In my work I produce a "derived" dataset, similar to what is done in the DataToolkit.jl tutorial. However, I am searching for a minimally invasive way to transform the following script, into something DataToolkit.jl compatible. Let's say I have this:

using PkgA, PkgB, ...

X = load(dataset1_path)
Y = load(dataset2_path)
...

W = product_new_dataset_from_others(X, Y, ...)

save(datasetW_path)

How do I leverage DataToolkit.jl, so that the dataset W is re-created on-demand only when any of the input datasets is modified, given this script? Let's assume that I have already transformed X, Y, ... already into DataToolkit.jl data entries, as it is clear from the docs how to do this.

Datseris avatar Jun 09 '24 10:06 Datseris

It's not exactly what we talked about, but this seems like a good place to note that the API in v0.10 is beginning to make programmatic DataSet creation feasible without being an ugly mess.

Sample

const REGISTRY_URL = "https://pkg.julialang.org"

const regdata = loadcollection!(joinpath(@__DIR__, "RegistryData.toml"))
const pkgsources = dataset(regdata, "PackageSources") |> read

const pkgdata = DataCollection("PackageData", plugins = ["defaults", "store"])

for (name, uuid, url, hash) in pkgsources
    pkgfiles = create!(pkgdata, DataSet, name, "description" => "The source files of the package $name.")
    storage!(pkgfiles, :web, "url" => "$REGISTRY_URL/package/$uuid/$hash")
    loader!(pkgfiles, :chain, "loaders" => ["gzip", "tar"])
    pkgstrs = create!(pkgdata, DataSet, name * " strings",
                      "description" => "The strings extracted from the source files of the package $name.")
    storage!(pkgstrs, :passthrough,
             "source" => string(Identifier(pkgfiles)),
             "type" => Dict{String, IO})
    loader!(pkgstrs, :julia,
            "input" => Dict{String, IO},
            "path" => "Data.d/extract_string.jl",
            "type" => Vector{String})
end

write(joinpath(@__DIR__, "PackageData.toml"), pkgdata)

tecosaur avatar Sep 25 '24 19:09 tecosaur

thanks, perhaps you can attach a text description of what the script does?

Datseris avatar Sep 26 '24 09:09 Datseris

Sure! That's taking a list of pkgsources (gzip'd julia package source tarballs), and for each one generating a DataSet for the untar-d content, and another DataSet for all the strings in that package, for example:

(RegistryData) data> stack list 
 #  Name          Datasets  Writable  Plugins                         
 ─────────────────────────────────────────────────────────────────────
 1  RegistryData  2         yes       cache, defaults, memorise, store
 2  PackageData   19849     yes       defaults, store                 

julia> d"DrWatson strings"
1049-element Vector{String}:
 "<NAME-PLACEHOLDER>"
 "dummy_src_file.jl"
 "\nCurrently active project is: \$" ⋯ 192 bytes ⋯ "ening your own Pull Requests!\n"
 "double"
 "a=0.1535_b=5_mode=double"
 "n_a=0.153_b=5_mode=double"
 "n"
 ⋮
 "."
 ""
 "jld2"
 "tmp"
 "_research"
 "\n    tmpsave(dicts::Vector{Dict" ⋯ 635 bytes ⋯ " to wsave (e.g. compression).\n"
Sample of the generated Data TOML
[[DrWatson]]
uuid = "c36fd30f-9fa2-469d-8eb2-3a5f86ad49a6"
description = "The source files of the package DrWatson."

    [[DrWatson.storage]]
    driver = "web"
    url = "https://pkg.julialang.org/package/634d3b9d-ee7a-5ddf-bec9-22491ea816e1/32704fb48e1ecd3739d5018df35282237b823f0a"

    [[DrWatson.loader]]
    driver = "chain"
    loaders = ["gzip", "tar"]

[["DrWatson strings"]]
uuid = "9d4121b5-e042-40bd-839e-631dfb4f7a31"
description = "The strings extracted from the source files of the package DrWatson."

    [["DrWatson strings".storage]]
    driver = "passthrough"
    source = "PackageData:DrWatson"
    type = "Dict{String,IO}"

    [["DrWatson strings".loader]]
    driver = "julia"
    input = "Dict{String,IO}"
    path = "Data.d/extract_string.jl"
    type = "Array{String,1}"

I think a layer of convenience on top of this that gets us closer to produce_or_load might be something like:

@jldataset "Name" function(a = d"input1", b = d"input2")::Int
   a * b # (pure) code that produces the result
end

Allowing d"Name" to be used in subsequent code.

tecosaur avatar Sep 26 '24 18:09 tecosaur

Copying comments in a different issue that largely duplicate this (but documented for completeness).

Reading through the documentation I wonder if there might be some things that could improve the ergonomics of processing data (I'm afraid I haven't had much chance to use DTK in anything meaningful, so these may be non-issues). I'm very happy to be told no if you feel like I'm asking for things that are outside of the intended scope, or were previously considered and discounted.

I see this package as being very useful in providing a full data-to-output pipeline, in particular, I'm interested it the ability for DTK to propagate 'invalidations' and recompute dependencies upon data changes. I haven't had a chance to dive into the code, but I'm guessing that dependencies are also recomputed if any of the underlying functions change, given the loader is saved as a string (and presumably hashed).

However, from what I can tell, the DTK and Data.toml approach (seem) to rely upon a very interactive style of development - starting the creation of a new object using make new_object in the DTK REPL. For short modifications this isn't an issue, but for more involved processing steps this may be rather involved (even wrapping everything in a package that can be @required).

Would it be possible to implement a macro that produces the same output, but instead allows the user to specify the processing steps within a file (or series of files), much like the {targets} R package? I was thinking it would be great if a user could write a script like below:

# processing-example.jl
using InternalPackage: InternalPackage

iris_clean = @make(InternalPackage.cleaning_step(iris))
iris_processed = @make(InternalPackage.processing_step(iris_clean))

where @make transforms to the commands the REPL mode executes (repl_make()?), along with the @require calls?

In an ideal world, it would be great to be also able to have a command to check for outdated objects, though maybe that already exists?

Thanks again!

arnold-c avatar May 20 '25 18:05 arnold-c