fslint
fslint copied to clipboard
Multi-threaded
Original issue 56 created by pixelb on 2010-09-08T01:59:32.000Z:
Operations from the GUI should be multi-threaded. That is, once a list of duplicate files has been located, merging them should be (potentially) handled by as many threads as there are processor cores. I'm fairly certain there are other areas within the application that could benefit from multi-threading as well. As it currently stands, the merging process seems to use a single core and is subsequently quite the bottleneck for the application.
Comment #1 originally posted by pixelb on 2010-09-08T10:04:13.000Z:
I've considered this, and it would be easy enough to add given the architecture (checksumming done in coprocess already). However is the disk not the bottleneck? If you have multiple disks and multiple cores then there may be a case for spreading the tasks, but I'm not sure it's worth it
Comment #2 originally posted by pixelb on 2010-09-08T14:40:55.000Z:
No, I don't believe the drive is the current bottleneck in anyway. The fact that once I start a merge the iowait time is next to nothing and the CPU utilization for one core is pegged at 100% seems to clearly indicate a threading bottleneck.
Furthermore, I've noticed that the GUI is almost completely unresponsive during this time.
Comment #3 originally posted by pixelb on 2010-09-08T15:13:20.000Z:
Oh the merge. That should be trivial. It just does unlink(dup); link(orig, dup); I guess some file systems could take a while to unlink() but the time should be allocated to the system rather than fslint.
How many files are you merging? Do you have man large files? Is all CPU allocated to fslint-gui?
I'll do some testing later...
Comment #4 originally posted by pixelb on 2010-09-08T15:32:01.000Z:
It's a mixture of file sizes some multiple gigabytes and some only a few kilobytes. During the entire merge, the python process for fslint's gui was pegging a single processor core. The scan itself took several hours (8 or more), the merge took a few hours as well (2 or so).
Comment #5 originally posted by pixelb on 2010-09-08T16:13:13.000Z:
You must have many huge files that are the same size. Though even in that case, fslint is tweaked to not checksum the whole file if possible. So you must have had many huge duplicate files. 2 hours to merge though is very surprising. I'll have a look.
Comment #6 originally posted by pixelb on 2010-09-08T17:54:15.000Z:
Yes, the volume is the storage location for my nightly backups made using "Back In Time"[1]. Normally, fslint would not be needed in such a situation as the backups are supposed to start from hard link copies of a previous backup and only replace changed files. However, due to a problem I've been having with the software I needed to create a new fresh backup and in doing so duplicated a LOT of files and space. So, I figured I'd use fslint to locate the duplication. I believe it did manage to locate the duplication, but can't say that it seemed very efficient in doing so. I would have thought it'd examine the inode of files as a starting point, if they are on the same physical volume. This should be a very inexpensive operation. But as I said, the scan of the location easily took over 8 hours.
[1] - http://backintime.le-web.org/
Comment #7 originally posted by pixelb on 2010-09-08T21:38:31.000Z:
Ah, so you must have had very many already hardlinked files.
FSlint used to only checksum a particular inode once, but that meant it could misreport some duplicate groups if they had separate inodes. I.E. in the case below, previous logic would have missed file2, whereas the current logic is inefficient.
inode file
1 file1 1 file2 2 file3
So it's not a multithreaded issue, it's an algorithmic issue. Not reporting all hardlinked files as duplicates would also mean that the GUI would have less work to do, so the merge process would be quicker.
Comment #8 originally posted by pixelb on 2011-06-11T20:35:03.000Z:
one comment: a number of backup/archive programs either default to, or can be configured to generate an archive that is split across multiple files, and these files are often a fixed size. This is one case where you can end up with a very large number of files that are of identical size.