fastdup icon indicating copy to clipboard operation
fastdup copied to clipboard

wrong duplicate images

Open lolongcovas opened this issue 1 year ago • 17 comments

I run this tool over 999_999 images, and it found these duplicates: image visually they are not duplicate. something is wrong in the indexes of filename to index?

lolongcovas avatar Aug 08 '22 07:08 lolongcovas

Hi @lolongcovas we most likely have a bug in the latest released version 0.125 which version are you using? The bug happens when there are numerical errors for a few images and then the image order is messed up. Did you get any printouts of the type "bug on XX actual XX pos XX dist" ? Also which OS?

dbickson avatar Aug 08 '22 07:08 dbickson

p.s. Can you try to install version 0.123 and let me know if it works for you?

dbickson avatar Aug 08 '22 07:08 dbickson

Ubuntu 20.04,

Fastdup version:

In [16]: fastdup.__version__
Out[16]: '0.125'

lolongcovas avatar Aug 08 '22 08:08 lolongcovas

hi @lolongcovas I have just released v. 126 for ubuntu 20.04 can you try it out and let me know if this works? You can upgrade using python3.8 -m pip install -U --force-reinstall fastdup

dbickson avatar Aug 08 '22 09:08 dbickson

with 0.123 still gives this results image

i am going to try with the v.126

lolongcovas avatar Aug 08 '22 09:08 lolongcovas

hi @lolongcovas I have just released v. 126 for ubuntu 20.04 can you try it out and let me know if this works? You can upgrade using python3.8 -m pip install -U --force-reinstall fastdup

Sorry, could u compile it for ubuntu 18.04?

Thanks

lolongcovas avatar Aug 08 '22 09:08 lolongcovas

I just released v 0.126 for ubuntu 18.04 as well. Please try it out and if it does not work, send me the output of your run when running with verbose=1. Also let me know if lower on the html there are identical images? Or eveything is messed up?

dbickson avatar Aug 08 '22 10:08 dbickson

still seeing the issue. does the fastdup prevent invalid images?

Replacing lower threshold 0.05 with position1899997 top_k.size() 1999996 loc pos: 0.8098 last pos: 00.95 1.9e+06
Total time took 5192278 ms
Found a total of 108835 fully identical images (d>0.990), which are 3.63 %
Found a total of 48956 nearly identical images(d>0.980), which are 1.63 %
Found a total of 1161555 above threshold images (d>0.900), which are 38.72 %
Found a total of 99997 outlier images         (d<0.050), which are 3.33 %
Min distance found 0.000 max distance 1.000

the 999_998, might we missed 1 image? and that does all fail?

lolongcovas avatar Aug 08 '22 13:08 lolongcovas

If there was a bad image you should see a file named atrain_features.bad.csv under the work_dir with the list of bad images filenames. I have no clue what is happening, I think you may have a numerical error, I need the full output, did you see something printed like : "bug on...."? I will be happy to setup a zoom session for tomorrow to debug what is going on.

dbickson avatar Aug 08 '22 15:08 dbickson

the log is huge, i didnt store it as a file. But the bug u mention is here:

KNN results                                                                                                                    
    0 : 1.00000    14 : 1.00000 814470 : 0.93696                                                                              
    1 : 1.00000 606001 : 0.90209 502470 : 0.88668                                                                             
    2 : 1.00000 955562 : 0.87797 271322 : 0.87653                                                                            
    3 : 1.00000 733775 : 0.97630 351683 : 0.95423                                                                             
    4 : 1.00000 139433 : 0.86121 586396 : 0.85914                                                                             
    5 : 1.00000 131656 : 0.91445 814949 : 0.91100                                                                             
    6 : 1.00000 408403 : 0.88667 948122 : 0.88571                                                                             
    7 : 1.00000 75745 : 0.94520 266015 : 0.94113
    8 : 1.00000 388340 : 0.88984 806809 : 0.88518
    9 : 1.00000 241051 : 0.83276 788561 : 0.82800
Bug on 104675 1 I 441151 actual 999999 pos 314026 k 3 dist -0.023188
Bug on 104675 2 I 677748 actual 999999 pos 314027 k 3 dist -0.029119
Bug on 214965 1 I 638478 actual 999999 pos 644896 k 3 dist -0.002828
Bug on 214965 2 I 939452 actual 999999 pos 644897 k 3 dist -0.003723
Bug on 249287 1 I 948374 actual 999999 pos 747862 k 3 dist -0.004687
Bug on 249287 2 I 750071 actual 999999 pos 747863 k 3 dist -0.006296
Bug on 394579 1 I 53821 actual 999999 pos 1183738 k 3 dist -0.001735
Bug on 394579 2 I 273531 actual 999999 pos 1183739 k 3 dist -0.002979
Bug on 397983 1 I 929964 actual 999999 pos 1193950 k 3 dist -0.064297
Bug on 397983 2 I 32925 actual 999999 pos 1193951 k 3 dist -0.064344
Bug on 936462 1 I 538900 actual 999999 pos 2809387 k 3 dist -0.048341
Bug on 936462 2 I 62374 actual 999999 pos 2809388 k 3 dist -0.058699

and also I found 1 image is failed. Now I am recomputing for the rest.

lolongcovas avatar Aug 08 '22 16:08 lolongcovas

The "bug on" messages are 6 images out of 1M which got into numerical error and thus got distance = 0 those you should ignore. I am still not sure about the duplicates which got 1 but are not identical. Can you send me the top 3 rows of images that are shown as duplicates but are not? I want to run the computation on my side, all the tests pass on my machines (ubuntu 18+20+mac m1) I was not able to reproduce this error yet.

dbickson avatar Aug 08 '22 16:08 dbickson

after removing that bad image, i got this duplicates: image still have the issue. the log:

FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
Going to loop over dir /tmp/zara.txt                                  
Found total 999998 images to run on                                   
libpng warning: sRGB: out of place                 ] 31% Estimated: 66 Minutes
libpng warning: sRGB: out of place                 ] 42% Estimated: 54 Minutes
Wrote total of 999998 features , found 0 bad images] 100% Estimated: 0 Minutes
Found total 999998 images to run on                                   
                                                                      
111037) Finished write_index() NN model                               
Stored nn model index file ../fastdup/out/nnf.index                   
Bug on 388602 1 I 169647 actual 999998 pos 1165807 k 3 dist -0.132486 
Bug on 388602 2 I 255365 actual 999998 pos 1165808 k 3 dist -0.138604 
Bug on 390441 1 I 774903 actual 999998 pos 1171324 k 3 dist -0.017260 
Bug on 390441 2 I 161661 actual 999998 pos 1171325 k 3 dist -0.017907 
Bug on 568423 1 I 659105 actual 999998 pos 1705270 k 3 dist -0.044372 
Bug on 568423 2 I 755145 actual 999998 pos 1705271 k 3 dist -0.044996 
Bug on 596245 2 I 150585 actual 999998 pos 1788737 k 3 dist -0.000455 
Bug on 779143 1 I 2287 actual 999998 pos 2337430 k 3 dist -0.058276
Bug on 779143 2 I 62638 actual 999998 pos 2337431 k 3 dist -0.058675
Bug on 946843 1 I 197592 actual 999998 pos 2840530 k 3 dist -0.047367    
Bug on 946843 2 I 474090 actual 999998 pos 2840531 k 3 dist -0.049710    
1659980440 : INFO:     (add_vertices:460): Num vertices for group 0: 999998 
1659980440 : INFO:     (commit_edge_buffer:609): In commit edge buffer (0,0)
1659980440 : INFO:     (commit_edge_buffer:680): Shuffling edges ...
1659980440 : INFO:     (commit_edge_buffer:688): Done shuffling edges in 0.04235 secs
1659980440 : INFO:     (commit_edge_buffer:692): Aggregating unique vertices...
1659980440 : INFO:     (commit_edge_buffer:705): Done aggregating unique vertex in 0.022127 secs
1659980440 : INFO:     (commit_edge_buffer:713): Combine vertex data
1659980440 : INFO:     (commit_edge_buffer:779): Done phase 2 in 0.062082 secs                         
1659980440 : INFO:     (commit_edge_buffer:787): Rename id columns 
1659980440 : INFO:     (commit_edge_buffer:890): Done in 0.131735 secs                                 
1659980440 : INFO:     (commit_edge_buffer:892): Finish committing edge in 0.258471 secs
1659980440 : INFO:     (add_edges:584): Num vertices for group 0: 999998
Num vertices for group 0: 999998
Num edges 0 -> 0: 117374
1659980441 : PROGRESS: (_p:516): +-----------------------------+
1659980441 : PROGRESS: (_p:516): | Number of components merged |
1659980441 : PROGRESS: (_p:516): +-----------------------------+
1659980441 : PROGRESS: (_p:516): | 72425                       |
1659980441 : PROGRESS: (_p:516): | 0                           |
1659980441 : PROGRESS: (_p:516): +-----------------------------+
1659980442 : PROGRESS: (triple_apply_pagerank:69): Counting out degree
1659980442 : PROGRESS: (triple_apply_pagerank:78): Done counting out degree
1659980442 : PROGRESS: (_p:516): +-----------+-----------------------+
1659980442 : PROGRESS: (_p:516): | Iteration | L1 change in pagerank |
1659980442 : PROGRESS: (_p:516): +-----------+-----------------------+
1659980442 : PROGRESS: (_p:516): | 1         | 771632                |
1659980442 : PROGRESS: (_p:516): | 2         | 4258.05               |
1659980442 : PROGRESS: (_p:516): | 3         | 2664.38               |
1659980442 : PROGRESS: (_p:516): | 4         | 1918.96               |
1659980442 : PROGRESS: (_p:516): | 5         | 1438.16               |
1659980442 : PROGRESS: (_p:516): | 6         | 1133.33               |
1659980442 : PROGRESS: (_p:516): | 7         | 907.308               |
1659980442 : PROGRESS: (_p:516): | 8         | 737.427               |
1659980442 : PROGRESS: (_p:516): | 9         | 603.873               |
1659980442 : PROGRESS: (_p:516): | 10        | 499.224               |
1659980442 : PROGRESS: (_p:516): | 11        | 414.308               |
1659980442 : PROGRESS: (_p:516): | 12        | 346.034               |
1659980442 : PROGRESS: (_p:516): | 13        | 289.312               |
1659980442 : PROGRESS: (_p:516): | 14        | 242.876               |
1659980442 : PROGRESS: (_p:516): | 15        | 203.985               |
1659980442 : PROGRESS: (_p:516): | 16        | 171.77                |
1659980443 : PROGRESS: (_p:516): | 17        | 144.69                |
1659980443 : PROGRESS: (_p:516): | 18        | 122.08                |
1659980443 : PROGRESS: (_p:516): | 19        | 103.042               |
1659980443 : PROGRESS: (_p:516): | 20        | 87.0725               |
1659980443 : PROGRESS: (_p:516): +-----------+-----------------------+
Wrote total of 999998 components
Total time took 5673428 ms
Found a total of 66513 fully identical images (d>0.990), which are 2.22 %
Found a total of 49974 nearly identical images(d>0.980), which are 1.67 %
Found a total of 1162877 above threshold images (d>0.900), which are 38.76 %
Found a total of 99999 outlier images         (d<0.050), which are 3.33 %
Min distance found 0.001 max distance 1.000

lolongcovas avatar Aug 08 '22 17:08 lolongcovas

Hi @lolongcovas we just released another version 0.127 for ubuntu 18 please try again and let us know if this works!!

dbickson avatar Aug 08 '22 19:08 dbickson

now with 0.127 worked. thanks!

lolongcovas avatar Aug 09 '22 07:08 lolongcovas

hi, sorry again. on the position 187 i found wrong duplicates: image however, 184 and 189 are ok.

lolongcovas avatar Aug 09 '22 07:08 lolongcovas

This is totally strange and I can't reproduce it on my side. Are you open to setting up a zoom meeting or communicating in the slack channel so we can try and reproduce it together? Once I reproduce the problem I am sure I can solve it. But it is hard to reproduce without having the data.

dbickson avatar Aug 09 '22 17:08 dbickson

hi @lolongcovas we have released version 0.130 which tries to fix the issue observed, please try it out.

dbickson avatar Aug 12 '22 04:08 dbickson

Solved.

dbickson avatar Aug 30 '22 11:08 dbickson