isisdl icon indicating copy to clipboard operation
isisdl copied to clipboard

Feature Request: Find duplicates on modern file systems

Open ghost opened this issue 2 years ago • 8 comments

This tool is great, thanks for your hard work! It saved me from a lot of anger when ISIS was down. I have one more suggestion:

Sometimes you want to download an old ISIS course because it contains useful information and may already have material that the current course does not. (Future exercises, to work ahead, or videos from online teaching that have now been replaced with in-person tutoring) Of course content such as exercises or current information can change, which is why it's handy to have both courses.

However, duplicates can often be found, e.g. in videos. To save disk space you could benefit from the copy-on-write feature of modern file systems like BTRFS, XFS or APFS. Maybe you can use https://github.com/jbruchon/jdupes for this.

Would be cool to have the feature in isisdl, so that files are not downloaded multiple times but only cloned. For example, if an ISIS course from the summer semester 2022 gets a new video, which is already included in the 2021 course, you should not have to download it separately but instead use a copy on write feature or perhaps a symlink on older file systems.

Maybe this is just a edge case and not worth programming, but I know some people who access old ISIS courses because lecturers refuse to provide material in advance for some irrational reason.

ghost avatar May 13 '22 09:05 ghost

This is an excellent feature and one, which is already partially implemented.

First off: I find hardlinks to be a better alternative to copy-on-write since they provide the following:

  • Any reasonable modern filesystem (except maybe Fat32) supportes them.
  • If you modify it in one "place" you modify it "everywhere". Depending on your use-case this might be a drawback. I find it to be a feature.
  • copy-on-write is not supported on ext4, which is the main file system used by isisdl users.

Considering these points, hard links are a better option than copy-on-write.

Current heuristic

The current heuristic is as follows (source, source2):

  1. If files have the same course id, name and size, they are deemed to be equal.
  2. If files have the same download url they are deemed to be equal.

Luckily for us the ISIS server does most of the heavy lifting. It implements a de-duplication on it's own and provides the same download url even across different courses. Thus, most of the space used (95% with my courses / files) is already de-duplicated. Sometimes, however this doesn't really work / the same videos are compressed a little bit different. But in this case no de-duplication could help us.

Now you might notice that this heuristic currently neglects same documents across different courses. This is intended since the only heuristic that is available is the file size.

Why set the size on a per course basis?

The assertion that every size is unique in a single course is usually valid, however it is not if you consider all courses. Because of that, the detection in advance is limited to files across a single course. There simply isn't enough information at this point in time to further determine if files are equal.

This, however can be solved, if duplicates are de-duplicateded after downloading. This is a bit more inefficient, but considering the fact that these files make up about 5% of the total download size this should be fine. With the total information about the file and a almost infinite read speed, it is possible to de-duplicate them even more.

I plan on using jdupes for the de-duplication after downloading. It will be part of the isisdl --compress routine and probably? only be available on linux / if the binary is in the PATH.

Emily3403 avatar May 14 '22 17:05 Emily3403

I've just re-downloaded my entire contents of ISIS (181GB videos, 6GB Documents) and threw jdupes at it:

du -sb isisdl
>>> 192456576558 isisdl

jdupes -r -L isisdl

du -sb isisdl
>>> 192410524062 isisdl

As you can see there is about 44MiB decrease in size.

If you want you can also post your results. It is interesting to see if it were better for your use-case.

Emily3403 avatar May 15 '22 07:05 Emily3403

Great technical write-up! This is definitely a reasonable approach. To be honest I do not know a whole lot about filesystems (I'm a second semester cs student anyway) and was just so bold enough to request a feature without real in-depth knowledge. :D

I did not even know about hard-links, I just discovered this deduplication feature of APFS and thought hey maybe this could save some disk space. But indeed hard-links are better and more reliable across the board. I'm looking forward to see this feature in isisdl.

I did also run jdupes -r -L and my downloads went down from 114GB to 71GB. So an almost 40% decrease. And I haven't even downloaded all the courses I need. Also isisdl is not confused by hard-links. When I use deduplication with jdupes -B -r it messed up some metadata I guess and isisdl would not recognize all data anymore and start downloading again. It would be nice to have a choice, should this be part of the --compress option, if you want to have the compression via ffmpeg or only duplication checking, since the former requires quite a lot of cpu muscle and time.

HMU if you need some testing on macOS. So far jdupes is working fine, so I can imagine this does not have to be Linux only functionality. The only two useful package managers on macOS also bring jdupes to PATH so this should not be a problem. A note about this on GitHub or during initial setup should be clear. Since this is nice-to-have and not a crucial requirement users will also be fine without and could reconfigure their setup when they install jdupes afterwards.

ghost avatar May 15 '22 10:05 ghost

Sorry for responding so late, I've had a lot of work with Uni at the moment. Anyway - would you mind sharing the courses you are currently enrolled in? With this information I should be able to track down what files are not correctly hardlinked and where the space is saved since this should be possible without jdupes.

isisdl itsself doesn't use much metadata from the files - only the size and, when syncing, the entire content of the file. This is due to the fact that "interesting" metadata about files is not uniform across different filesystems. When normally executing isisdl the only metadata queried is the size. I would assume that jdupes did not mess with this attribute and isisdl should not be confused about what is downloaded and not. You can of course try to execute isisdl --sync in order to synchronize the database, however it should not be necessary. In fact in my testing I found that isisdl was never confused about files when executing jdupes. Does isisdl get confused consistently or does it only get confused sometimes?

I don't know exactly how or when I will implement this feature, but I want to keep the amount of questions in the configuration assistant as low as possible. Maybe it will be as a first step in the compression process since checking for duplicates does not require that much cpu power. Afterwards the user could cancel the compression. But I'll think about that in due time.

Thanks anyway for the feature suggestion ^^

Emily3403 avatar May 23 '22 14:05 Emily3403

So I don't quite know the status of development, but I hope my feedback still helps. I configured isisdl to download Files for AlgoDat 2021, 2022 Sysprog 2021,2022 and ForSA 2022. Also I often have to run isisdl twice to download everything, because it is missing new files in the first run.

ghost avatar Jun 27 '22 21:06 ghost

First of all thanks for the courses. I could find all of them but SysProg. Can you send me the course ID located in https://isis.tu-berlin.de/course/view.php?id={}?

I'll try jdupes myself on that dataset, and I'm interested in the results. Maybe the savings made by the jdupes algoritm could be natively integrated in isisdl's filesize reduce algorithm?

As for the current state of development: I would love to implement a frontend for jdupes. It is a bit tricky to let isisdl know which files should be which, and thus it takes a bit of time and effort to make that work. Currently, I don't have the time needed for me to implement this feature. Maybe I'll get around to it in the Semesterferien.

If you are however interested in coding a frontend for jdupes, I am gladly accepting pull requests ^^

For the multiple download bug: I could verify it. I don't know what causes it yet, but I think it's a course that is somehow broken.

Emily3403 avatar Jun 27 '22 22:06 Emily3403

The course id for Sysprog 2022 is 28476 and for 2021, 23037. i don't know if i can deliver satisfactory code quality for the jdupes frontend, but i would give it a try during the semester break (so in about a month). :-)

ghost avatar Jul 02 '22 21:07 ghost

Sounds good ^^ If you have any further questions regarding how isisdl works internally feel free to ask ^^

Emily3403 avatar Jul 03 '22 14:07 Emily3403