pkgdepends icon indicating copy to clipboard operation
pkgdepends copied to clipboard

Caching for local packages

Open tzakharko opened this issue 2 years ago • 13 comments

I noticed local packages are currently not cached. If this feature is considered to be of interest, I can try to implement caching. From a cursory glance I suppose this involves manipulation of satisfy_remote_local() and/or installedok_remote_local()? Any pointers on where to start?

tzakharko avatar Dec 01 '21 09:12 tzakharko

The main issue for local packages is that it is harder to decide if the cache is valid or not.

gaborcsardi avatar Dec 01 '21 09:12 gaborcsardi

What about something like this: one silently builds the local package, takes a checksum and checks it agains the cache. If building the package has failed, we throw an error.

Or, a pragmatic approach: simply checksum the contents of the local directory. This might result in many false positives, but it's probably good enough for the use case.

tzakharko avatar Dec 01 '21 10:12 tzakharko

What about something like this: one silently builds the local package, takes a checksum and checks it agains the cache. If building the package has failed, we throw an error.

If you already build the (binary?) package, then you might as well install it, that is fast.

Or, a pragmatic approach: simply checksum the contents of the local directory. This might result in many false positives, but it's probably good enough for the use case.

Yeah, something like that would work. What is a false positive here? A cache miss that should not happen? Or a cache hit that should not happen?

gaborcsardi avatar Dec 01 '21 10:12 gaborcsardi

The tricky stuff is that if you build-time depend on some package, and that has changed, that should ideally trigger a re-install. Similarly, if the system dependencies have changed, that should also trigger a re-install. It does not seem possibly to solve this 100%, and it hard to give a solution that won't break some people's workflow.

gaborcsardi avatar Dec 01 '21 10:12 gaborcsardi

The tricky stuff is that if you build-time depend on some package, and that has changed, that should ideally trigger a re-install. Similarly, if the system dependencies have changed, that should also trigger a re-install. It does not seem possibly to solve this 100%, and it hard to give a solution that won't break some people's workflow.

Yeah , I see what you mean. I think the problem is that there are really two use cases — local package installation for development purposes, where the current behavior of always reinstalling makes sense (probably how this is used in most cases), and just installing a package that has for some reason not been published yet.

My use case that motivated this inquiry is that I have this very big data export/validation/transformation pipeline. To prevent its dependencies from interfering with the usual R system it installs packages in a private library — there is is basically a pkg:pkg_install(deps) story going on every pipeline invocation on which works great. But the internal packages (which are provided as source trees) get reinstalled every single time, which is not a big deal honestly, but it does bug me a little bit :)

tzakharko avatar Dec 01 '21 10:12 tzakharko

In theory we could also try to use the mtime of the files/directories, but that is unreliable on some systems, so better to stay away from it, I think.

I think hashing the files in the local dir could work reasonably well. There is already a ?nocache option to opt out of the cache, so we would observe that and skip the hashing + caching altogether. We could also have a ?cache option to force the cache for a local package, maybe that's useful, but I am not convinced.

gaborcsardi avatar Dec 01 '21 10:12 gaborcsardi

Relying on extra option would introduce additional complexity and arguably make the user experience worse, so not really a fan. Maybe one can have two different types of local packages — local that always builds and another one that is cached, but I think this quickly becomes a bit ridiculous. I would be happy with simply supporting git remotes (#14), as it will solve all my woes and is probably a more useful approach anyway.

tzakharko avatar Dec 01 '21 10:12 tzakharko

@gaborcsardi If you don't have any other comments I would close it in favor of #14

tzakharko avatar Dec 01 '21 10:12 tzakharko

Oh, I do want to have caching support for local packages, so let's keep this open. :) I think more people would use this than git remotes, actually. But understand if you want to focus on #14 instead of this.

gaborcsardi avatar Dec 01 '21 10:12 gaborcsardi

Btw. this is a good summary of the issues with mtimes: https://apenwarr.ca/log/20181113

gaborcsardi avatar Dec 01 '21 10:12 gaborcsardi

I think the ultimate question is what caching should accomplish — avoiding doing work (e.g. building the package) or avoiding installing the package if it has not changed? If it is the first, then yes, it is going to be more tricky (I don't know if pkgdepends already looks at build time dependencies for example or whether that is something that should be added). If you want to be paranoid bout these things, rebuilding the package (but staying silent unless it actually changed) would work, but it will take extra time and work.

My approach would be to checksum the directory contents (I wouldn't bother with mtimes TBH, too much hassle), that's performant enough on modern machines and with usual package sizes. One would need to take care of build time dependencies though. As to system configuration changes... for that one might need a "force rebuild" option of some sort.

But understand if you want to focus on #14 instead of this.

Well, both would solve my current "problem" and I think that both should be done ultimately, so I am happy to contribute. Caching local files is probably a simpler thing (all the tricky corner cases aside)

tzakharko avatar Dec 01 '21 10:12 tzakharko

Avoiding the install is less important, I think. Installing the binary package is fast, basically an unzip/untargz.

I don't want to be paranoid, hence my opinion re checksum. I would not even worry about build time dependencies yet. The existing ?nocache or ?reinstall options already force a rebuild + reinstall.

FWIW, a quick experiment on the BH package that is relatively big and has many files.

❯ system.time(d <- dir(recursive=TRUE))
   user  system elapsed
  0.059   0.039   0.097

❯ system.time(mt <- file.mtime(d))
   user  system elapsed
  0.003   0.017   0.020

❯ system.time(fs <- file.size(d))
   user  system elapsed
  0.003   0.017   0.020

❯ system.time(md5 <- tools::md5sum(d))
   user  system elapsed
  0.254   0.124   0.378

❯ length(d)
[1] 11691

❯ system("du -hs .")
144M	.

This is a pretty fast machine, so anywhere else is probably slower, especially on Windows. But nevertheless this is encouraging.

gaborcsardi avatar Dec 01 '21 11:12 gaborcsardi

Looks great! So even with paranoid checksumming we can expect the check to perform in well under a second for average machine and package size.

I will probably have some time next week to look into implementing this. Is there some docs on what satisfy() and installedok() are supposed to do?

tzakharko avatar Dec 01 '21 12:12 tzakharko