DataDeps.jl icon indicating copy to clipboard operation
DataDeps.jl copied to clipboard

Garbage collection functionality

Open rofinn opened this issue 3 years ago • 7 comments

It'd be really cool if DataDeps had a gc function which could identify old datadeps that aren't referenced/needed anymore and suggest deleting them. I'm not quite sure how it work, but it wouldn't need to be perfect. It could just make a best guess, ask the user, and they can always redownload things they actually need.

rofinn avatar Oct 28 '21 19:10 rofinn

Yes, I have some thoughts on this:

Some options:

  1. just use Artifacts.jl instead of DataDeps.jl.
  2. We workout how to hook into what Artifacts use for this.
  3. Do something like generate Artifacts on the fly from DataDep after they are downloaded.

oxinabox avatar Oct 29 '21 09:10 oxinabox

Do you mean Artifacts.jl in Pkg.jl? FWIW, I do think the flexibility (e.g., fetch_method, post_fetch_method) and macro are probably worth keeping in our use case.

rofinn avatar Oct 29 '21 13:10 rofinn

Yeah, is Artfiacts part of Pkg right now? I thought it was a seperate stdlib.

I do think the flexibility (e.g., fetch_method, post_fetch_method) and macro are probably worth keeping in our use case.

Agreed. But option 2 or 3 might let us still do that.

oxinabox avatar Oct 29 '21 14:10 oxinabox

I think maybe the dynamic artifact creation in OhMyArtifacts.jl might work to achieve option 3. but I haven't looked closely. (cc @chengchingwen)

oxinabox avatar Nov 16 '21 12:11 oxinabox

I think it's doable but would take some effort to integrate the two. There are some problems and behavior differences need to be resolved. Things like:

  1. What do we want to bind the life-time of a DataDep to? OhMyArtifacts.jl (also Scratch.jl) bind to the Project(.toml) who call it, so if we want to build on top of that, we would need to figure out who defined that DataDep. Otherwise all the DataDep will live until DataDeps.jl been removed.
  2. It doesn't seem to be possible to migrate the old downloaded DataDep, should we just delete them all?
  3. DataDep name need to be unique, but entry name in Artifacts.jl can bind to different artifact.
  4. Currently each entry in OhMyArtifacts.jl is a single file (instead of a directory), I would need to come up with a better design for holding the directory structure.

OTOH using Scratch.jl might probably be the most effortless and behavior consistent approach (though 1. 2. remains), we can just store each DataDep in a scratchspace folder and bind the lifetime to the caller project with get_scratch!(DataDeps, datadep_name, THAT_PACKAGE) and somehow allow DataDeps.register or DataDep constructor to pass in the caller package. Then when that package get GCed, the scratchspace will also be removed.

chengchingwen avatar Nov 16 '21 15:11 chengchingwen

@oxinabox I just tag a new release for OhMyArtifacts.jl. It should be able to handle directory now. I can try to make a PR to switch the backend to it, do you have a pointer on what files I would need to modified?

chengchingwen avatar Jun 19 '22 15:06 chengchingwen

  1. We can work this out by using the __location__ from where the datadep macro is called I think then searching for a Project.toml. Though really i guess we want the location where register was called. For hacks we could use the stacktrace() and go find it?

  2. No leave them alone. But we could put something in the release notes

  3. It is a flaw in DataDeps that they need to be globally unique, since it means clashes are possible. Unlike with the smarter content hashing that Artifacts use. One thing we do need to preserve is the ability to access datadeps via a global name (though that is allowed to error when there is a clash I guess), since i am sure some people install a package and then access the datadeps it registers by name.

Maybe scratch spaces, as you say are the better way to do this.

oxinabox avatar Sep 01 '22 18:09 oxinabox