DataDeps.jl
DataDeps.jl copied to clipboard
Garbage collection functionality
It'd be really cool if DataDeps had a gc
function which could identify old datadeps that aren't referenced/needed anymore and suggest deleting them. I'm not quite sure how it work, but it wouldn't need to be perfect. It could just make a best guess, ask the user, and they can always redownload things they actually need.
Yes, I have some thoughts on this:
Some options:
- just use Artifacts.jl instead of DataDeps.jl.
- We workout how to hook into what Artifacts use for this.
- Do something like generate Artifacts on the fly from DataDep after they are downloaded.
Do you mean Artifacts.jl in Pkg.jl? FWIW, I do think the flexibility (e.g., fetch_method
, post_fetch_method
) and macro are probably worth keeping in our use case.
Yeah, is Artfiacts part of Pkg right now? I thought it was a seperate stdlib.
I do think the flexibility (e.g., fetch_method, post_fetch_method) and macro are probably worth keeping in our use case.
Agreed. But option 2 or 3 might let us still do that.
I think maybe the dynamic artifact creation in OhMyArtifacts.jl might work to achieve option 3. but I haven't looked closely. (cc @chengchingwen)
I think it's doable but would take some effort to integrate the two. There are some problems and behavior differences need to be resolved. Things like:
- What do we want to bind the life-time of a DataDep to? OhMyArtifacts.jl (also Scratch.jl) bind to the Project(.toml) who call it, so if we want to build on top of that, we would need to figure out who defined that DataDep. Otherwise all the DataDep will live until DataDeps.jl been removed.
- It doesn't seem to be possible to migrate the old downloaded DataDep, should we just delete them all?
- DataDep name need to be unique, but entry name in Artifacts.jl can bind to different artifact.
- Currently each entry in OhMyArtifacts.jl is a single file (instead of a directory), I would need to come up with a better design for holding the directory structure.
OTOH using Scratch.jl might probably be the most effortless and behavior consistent approach (though 1. 2. remains), we can just store each DataDep in a scratchspace folder and bind the lifetime to the caller project with get_scratch!(DataDeps, datadep_name, THAT_PACKAGE)
and somehow allow DataDeps.register
or DataDep constructor to pass in the caller package. Then when that package get GCed, the scratchspace will also be removed.
@oxinabox I just tag a new release for OhMyArtifacts.jl. It should be able to handle directory now. I can try to make a PR to switch the backend to it, do you have a pointer on what files I would need to modified?
-
We can work this out by using the
__location__
from where the datadep macro is called I think then searching for a Project.toml. Though really i guess we want the location whereregister
was called. For hacks we could use thestacktrace()
and go find it? -
No leave them alone. But we could put something in the release notes
-
It is a flaw in DataDeps that they need to be globally unique, since it means clashes are possible. Unlike with the smarter content hashing that Artifacts use. One thing we do need to preserve is the ability to access datadeps via a global name (though that is allowed to error when there is a clash I guess), since i am sure some people install a package and then access the datadeps it registers by name.
Maybe scratch spaces, as you say are the better way to do this.