CrowdAnki icon indicating copy to clipboard operation
CrowdAnki copied to clipboard

Set to remove old media directory before exporting it again to clean unused files

Open evandrocoan opened this issue 1 year ago • 1 comments

evandrocoan avatar Aug 02 '24 19:08 evandrocoan

Thanks very much for the contribution (as always!)!

Clearing out all unused media files is indeed what we should be doing!

I'll need to check whether there's any risk of undesired removal of files elsewhere due to undefined variables (I think not, but best to be sure) (maybe it's best to just remove all files in the media directory, rather than do (the equivalent of) rm -r media/, to limit the potential damage scope, just in case we modify something elsewhere?) and whether there's a performance slow down (almost certainly not, since we're simply copying everything anyway in the next step). (In a couple of days, since I won't have time in the immediate future.)

aplaice avatar Aug 02 '24 21:08 aplaice

Thanks again! Sorry for the delay!

Testing with a synthetic deck with 10000 randomly generated PNGs, it seems that, with your PR applied:

  1. for a fresh export, there's no performance regression
  2. for an export over an existing collection, there's a performance gain
  3. for snapshotting, there doesn't seem to be much of a difference because everything is dwarfed by the git/dulwich operations...

AFAICT there's currently no issue with the code from a "safety" issue — the deck_directory is ultimately set by the user (both for export and snapshotting), so it could be technically anything, but we'll be only affecting /media — however, I'm still uncomfortable, given the various famous rm -rf $ACCIDENTALLY_UNSET_VAR/* bugs. (I'm not quite sure what would trigger the issue in our case, and python doesn't have as huge footguns as bash, but it's not impossible that some otherwise innocuous change elsewhere could cause the value of the media_directory variable to be set to something undesirable, under some edge-cases.) I'll try to reduce the potential fallout in these cases, if I can do it without significant performance regressions!

aplaice avatar Sep 15 '24 10:09 aplaice

Thanks for looking into this. The next step is not deleting the complete media directory but selectively removing missing files and only copying new files. This optimization would be because my collection has several images and audio for each note, and always copying the media takes one minute with i/o operations.

evandrocoan avatar Sep 15 '24 15:09 evandrocoan

This optimization would be because my collection has several images and audio for each note, and always copying the media takes one minute with i/o operations.

Sounds pretty annoying! :(

The next step is not deleting the complete media directory but selectively removing missing files and only copying new files.

One (relatively rare but still important) concern is that existing files may have changed. Checking the mtime of each file is probably good enough, but there might be some edge cases where it isn't and so one might need to check the hashes (but this might still be faster than just blindly copying everything).

aplaice avatar Sep 15 '24 17:09 aplaice