dduper
dduper copied to clipboard
insane fast is not very fast at all
when recursive into a directory of 6.2GB with 767 files in it, I thought the insane fast one will:
- compute a summary for each file by the csum of each block included in the file;
- Do a sort / uniq of these files
Since csum is already computed, this shouldn't take more than a minute in a modern computer. Instead, the process has been running 30 minutes now and the result already showing 2020 non-matching results.
My guess is that instead of using a sorting algorithm to find rows of duplicated files, your code might have constructed a 𝑛²-𝑛 size array (𝑛 refer to the number of files in the recursive directory) and checked every file pair in the array, right?
Yes, right. It does check two file at once since it doesn't store previously computed results on disk. "Insane" mode is fast with real dedupe between two files (say comparing two 100GB files). But its slow with large no.of files. I'll update the tool to store previously computed results in db. (sqlite) That should improve all over performance.
Hi how aobut this real quick way to solve this problem
First, bare with me explaining how people deduplicate files before the dedup tools were popular.
Deduplicating with existing Linux commands
Back in the 90s when Bojack Horseman was horsing around, people were used to do this:
$ find dir/ -type f -exec md5sum '{}' ';' | sort | uniq -w 32 -D
Which means compute the md5 checksum of every file under directory, then sort it (hence duplicated files are clustered together), then run uniq to find all lines where the first 32 characters duplicate. The first 32 characters is the md5sum value, while the rest are file names.
Assumption
Let's assume that there is a quick way to get a file's fingerprint of some sort by using btrfs
New method
Now, given that sorting and uniqueness (duplicate detect) are already in the Linux command line set, you only need to add a parameter such as --fingerprint, which outputs every file's fingerprint followed by the file's name. the fingerprint can be a 32-character data computed by computing a checksum of all csums of all its blocks. A user can then use a pipe to get the duplicated items.
$ dduper --fingerprint --recursive --dir /folder | sort | uniq -w 32 -D
And you can put that info in the EXAMPLE section of your man page.
The insanely fast but inaccurate method already working today
The following method didn't take advantage of btrfs, meaning btrfs should be able to outperform this.
There is a shortcut to get potential duplicates insanely fast by simply compare the file size (in the case of video mp4 files larger than 1GB, files of equal size is usually duplicate).
$ du -ab /videos/ | sort -n | awk -F $'\t' '{printf("%16s\t%s\n", $1, $2)}' | uniq -w 16 -D
The additional awk in the above command line merely converts file size into fixed 16 characters (16 characters is larger enough for all possible website videos' size).
But there are video files of equal size with different content, so a hash should be computed to be sure.
Off-topic: One of my other project server with more than 225000 users burned :sob: http://community.webminal.org/t/webminal-org-down-status-update-thread/1481 I will get back so this issue soon
Would it be possible to perform block scanning and hashes in parallel using process pool or even using Numba please?
when recursive into a directory of 6.2GB with 767 files in it, I thought the insane fast one will:
1. compute a summary for each file by the csum of each block included in the file; 2. Do a sort / uniq of these files
A missing point is the "os.walk()"... If you have many more than 767 files you will get some problems with current implementation here: https://github.com/Lakshmipathi/dduper/blob/87450782721beb7efc585b0c6697dabd82c423fc/dduper#L427-L443
- Collect all files may take really long, before the real dedupe part can be start
- RAM usage will become a problem at some point, because the file list is just to big
Added sqlite db to keep track of file csum fetched from btrfs csum-tree. It should increase the performance. Could you please check the latest code from this repo? thanks!
I believe this still did not solve the problem. After 18h24m8s of waiting I quited the process
sudo ./dduper --device /dev/mapper/adama-docs --dir /home/Adama-docs --recurse
bedup scans this filesystem in a few minutes (after purging its cache).
$ sudo btrfs fi usage /home/Adama-docs/
Overall:
Device size: 744.01GiB
Device allocated: 744.01GiB
Device unallocated: 1.00MiB
Device missing: 0.00B
Used: 624.19GiB
Free (estimated): 118.17GiB (min: 118.17GiB)
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 512.00MiB (used: 0.00B)
Data,single: Size:740.00GiB, Used:621.83GiB (84.03%)
/dev/mapper/adama-docs 740.00GiB
Metadata,single: Size:4.01GiB, Used:2.36GiB (58.95%)
/dev/mapper/adama-docs 4.01GiB
System,single: Size:4.00MiB, Used:96.00KiB (2.34%)
/dev/mapper/adama-docs 4.00MiB
Unallocated:
/dev/mapper/adama-docs 1.00MiB
$ sudo btrfs fi du -s /home/Adama-docs/
Total Exclusive Set shared Filename
660.36GiB 646.95GiB 6.70GiB /home/Adama-docs/
@adamryczkowski could you pls share dduper.log file? It should be available from the directory where you run this command.
It is very short. Here you go:
DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads
DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads
DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads
dduper.db is 986MB big. That may be a problem. Do you want me to upload that?
ah, okay. Looks like it never completed populating DB phase. dduper.db file don't have much info other than your filename, sha526sum. so uploading is not necessary. Do you have large no.of small files under /home/Adama-docs/ ?
Can you share output for the below command:
sqlite3 dduper.db 'select count(*) from filehash;'
It should display no.of records (file entry) on the DB.
$ sqlite3 dduper.db 'select count(*) from filehash;'
341865
$ sudo find /home/Adama-docs -type f | wc -l
1501608
$ df -h /home/Adama-docs
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/adama-docs 745G 626G 118G 85% /home/Adama-docs