dduper icon indicating copy to clipboard operation
dduper copied to clipboard

insane fast is not very fast at all

Open SmartLayer opened this issue 4 years ago • 14 comments
trafficstars

when recursive into a directory of 6.2GB with 767 files in it, I thought the insane fast one will:

  1. compute a summary for each file by the csum of each block included in the file;
  2. Do a sort / uniq of these files

Since csum is already computed, this shouldn't take more than a minute in a modern computer. Instead, the process has been running 30 minutes now and the result already showing 2020 non-matching results.

SmartLayer avatar Mar 05 '21 09:03 SmartLayer

My guess is that instead of using a sorting algorithm to find rows of duplicated files, your code might have constructed a 𝑛²-𝑛 size array (𝑛 refer to the number of files in the recursive directory) and checked every file pair in the array, right?

SmartLayer avatar Mar 07 '21 04:03 SmartLayer

Yes, right. It does check two file at once since it doesn't store previously computed results on disk. "Insane" mode is fast with real dedupe between two files (say comparing two 100GB files). But its slow with large no.of files. I'll update the tool to store previously computed results in db. (sqlite) That should improve all over performance.

Lakshmipathi avatar Mar 08 '21 02:03 Lakshmipathi

Hi how aobut this real quick way to solve this problem

First, bare with me explaining how people deduplicate files before the dedup tools were popular.

Deduplicating with existing Linux commands

Back in the 90s when Bojack Horseman was horsing around, people were used to do this:

$ find dir/ -type f -exec md5sum '{}' ';' | sort | uniq -w 32 -D

Which means compute the md5 checksum of every file under directory, then sort it (hence duplicated files are clustered together), then run uniq to find all lines where the first 32 characters duplicate. The first 32 characters is the md5sum value, while the rest are file names.

Assumption

Let's assume that there is a quick way to get a file's fingerprint of some sort by using btrfs

New method

Now, given that sorting and uniqueness (duplicate detect) are already in the Linux command line set, you only need to add a parameter such as --fingerprint, which outputs every file's fingerprint followed by the file's name. the fingerprint can be a 32-character data computed by computing a checksum of all csums of all its blocks. A user can then use a pipe to get the duplicated items.

$ dduper --fingerprint --recursive --dir /folder | sort | uniq -w 32 -D

And you can put that info in the EXAMPLE section of your man page.

The insanely fast but inaccurate method already working today

The following method didn't take advantage of btrfs, meaning btrfs should be able to outperform this.

There is a shortcut to get potential duplicates insanely fast by simply compare the file size (in the case of video mp4 files larger than 1GB, files of equal size is usually duplicate).

$ du -ab /videos/ | sort -n | awk -F $'\t' '{printf("%16s\t%s\n", $1, $2)}' | uniq -w 16 -D

The additional awk in the above command line merely converts file size into fixed 16 characters (16 characters is larger enough for all possible website videos' size).

But there are video files of equal size with different content, so a hash should be computed to be sure.

SmartLayer avatar Mar 09 '21 04:03 SmartLayer

Off-topic: One of my other project server with more than 225000 users burned :sob: http://community.webminal.org/t/webminal-org-down-status-update-thread/1481 I will get back so this issue soon

Lakshmipathi avatar Mar 10 '21 18:03 Lakshmipathi

Would it be possible to perform block scanning and hashes in parallel using process pool or even using Numba please?

ytrezq avatar Mar 11 '21 09:03 ytrezq

when recursive into a directory of 6.2GB with 767 files in it, I thought the insane fast one will:

1. compute a summary for each file by the csum of each block included in the file;
2. Do a sort / uniq of these files

A missing point is the "os.walk()"... If you have many more than 767 files you will get some problems with current implementation here: https://github.com/Lakshmipathi/dduper/blob/87450782721beb7efc585b0c6697dabd82c423fc/dduper#L427-L443

  1. Collect all files may take really long, before the real dedupe part can be start
  2. RAM usage will become a problem at some point, because the file list is just to big

jedie avatar Apr 23 '21 17:04 jedie

Added sqlite db to keep track of file csum fetched from btrfs csum-tree. It should increase the performance. Could you please check the latest code from this repo? thanks!

Lakshmipathi avatar May 19 '21 03:05 Lakshmipathi

I believe this still did not solve the problem. After 18h24m8s of waiting I quited the process

sudo ./dduper --device /dev/mapper/adama-docs --dir /home/Adama-docs --recurse

bedup scans this filesystem in a few minutes (after purging its cache).

$ sudo btrfs fi usage /home/Adama-docs/
Overall:
    Device size:		 744.01GiB
    Device allocated:		 744.01GiB
    Device unallocated:		   1.00MiB
    Device missing:		     0.00B
    Used:			 624.19GiB
    Free (estimated):		 118.17GiB	(min: 118.17GiB)
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,single: Size:740.00GiB, Used:621.83GiB (84.03%)
   /dev/mapper/adama-docs	 740.00GiB

Metadata,single: Size:4.01GiB, Used:2.36GiB (58.95%)
   /dev/mapper/adama-docs	   4.01GiB

System,single: Size:4.00MiB, Used:96.00KiB (2.34%)
   /dev/mapper/adama-docs	   4.00MiB

Unallocated:
   /dev/mapper/adama-docs	   1.00MiB
$ sudo btrfs fi du -s /home/Adama-docs/ 
     Total   Exclusive  Set shared  Filename
 660.36GiB   646.95GiB     6.70GiB  /home/Adama-docs/

adamryczkowski avatar Jun 02 '21 09:06 adamryczkowski

@adamryczkowski could you pls share dduper.log file? It should be available from the directory where you run this command.

Lakshmipathi avatar Jun 03 '21 03:06 Lakshmipathi

It is very short. Here you go:

DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads
DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads
DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads

dduper.db is 986MB big. That may be a problem. Do you want me to upload that?

adamryczkowski avatar Jun 03 '21 05:06 adamryczkowski

ah, okay. Looks like it never completed populating DB phase. dduper.db file don't have much info other than your filename, sha526sum. so uploading is not necessary. Do you have large no.of small files under /home/Adama-docs/ ?

Can you share output for the below command:

sqlite3 dduper.db 'select count(*) from filehash;'

It should display no.of records (file entry) on the DB.

Lakshmipathi avatar Jun 03 '21 06:06 Lakshmipathi

$ sqlite3 dduper.db 'select count(*) from filehash;'
341865

adamryczkowski avatar Jun 04 '21 09:06 adamryczkowski

$ sudo find /home/Adama-docs -type f | wc -l
1501608

adamryczkowski avatar Jun 04 '21 09:06 adamryczkowski

$ df -h /home/Adama-docs
Filesystem              Size  Used Avail Use% Mounted on
/dev/mapper/adama-docs  745G  626G  118G  85% /home/Adama-docs

adamryczkowski avatar Jun 04 '21 09:06 adamryczkowski