fastdup
fastdup copied to clipboard
wrong duplicate images
I run this tool over 999_999 images, and it found these duplicates:
visually they are not duplicate. something is wrong in the indexes of filename to index?
Hi @lolongcovas we most likely have a bug in the latest released version 0.125 which version are you using? The bug happens when there are numerical errors for a few images and then the image order is messed up. Did you get any printouts of the type "bug on XX actual XX pos XX dist" ? Also which OS?
p.s. Can you try to install version 0.123 and let me know if it works for you?
Ubuntu 20.04,
Fastdup version:
In [16]: fastdup.__version__
Out[16]: '0.125'
hi @lolongcovas I have just released v. 126 for ubuntu 20.04 can you try it out and let me know if this works? You can upgrade using python3.8 -m pip install -U --force-reinstall fastdup
with 0.123 still gives this results
i am going to try with the v.126
hi @lolongcovas I have just released v. 126 for ubuntu 20.04 can you try it out and let me know if this works? You can upgrade using
python3.8 -m pip install -U --force-reinstall fastdup
Sorry, could u compile it for ubuntu 18.04?
Thanks
I just released v 0.126 for ubuntu 18.04 as well. Please try it out and if it does not work, send me the output of your run when running with verbose=1. Also let me know if lower on the html there are identical images? Or eveything is messed up?
still seeing the issue. does the fastdup prevent invalid images?
Replacing lower threshold 0.05 with position1899997 top_k.size() 1999996 loc pos: 0.8098 last pos: 00.95 1.9e+06
Total time took 5192278 ms
Found a total of 108835 fully identical images (d>0.990), which are 3.63 %
Found a total of 48956 nearly identical images(d>0.980), which are 1.63 %
Found a total of 1161555 above threshold images (d>0.900), which are 38.72 %
Found a total of 99997 outlier images (d<0.050), which are 3.33 %
Min distance found 0.000 max distance 1.000
the 999_998, might we missed 1 image? and that does all fail?
If there was a bad image you should see a file named atrain_features.bad.csv under the work_dir with the list of bad images filenames. I have no clue what is happening, I think you may have a numerical error, I need the full output, did you see something printed like : "bug on...."? I will be happy to setup a zoom session for tomorrow to debug what is going on.
the log is huge, i didnt store it as a file. But the bug
u mention is here:
KNN results
0 : 1.00000 14 : 1.00000 814470 : 0.93696
1 : 1.00000 606001 : 0.90209 502470 : 0.88668
2 : 1.00000 955562 : 0.87797 271322 : 0.87653
3 : 1.00000 733775 : 0.97630 351683 : 0.95423
4 : 1.00000 139433 : 0.86121 586396 : 0.85914
5 : 1.00000 131656 : 0.91445 814949 : 0.91100
6 : 1.00000 408403 : 0.88667 948122 : 0.88571
7 : 1.00000 75745 : 0.94520 266015 : 0.94113
8 : 1.00000 388340 : 0.88984 806809 : 0.88518
9 : 1.00000 241051 : 0.83276 788561 : 0.82800
Bug on 104675 1 I 441151 actual 999999 pos 314026 k 3 dist -0.023188
Bug on 104675 2 I 677748 actual 999999 pos 314027 k 3 dist -0.029119
Bug on 214965 1 I 638478 actual 999999 pos 644896 k 3 dist -0.002828
Bug on 214965 2 I 939452 actual 999999 pos 644897 k 3 dist -0.003723
Bug on 249287 1 I 948374 actual 999999 pos 747862 k 3 dist -0.004687
Bug on 249287 2 I 750071 actual 999999 pos 747863 k 3 dist -0.006296
Bug on 394579 1 I 53821 actual 999999 pos 1183738 k 3 dist -0.001735
Bug on 394579 2 I 273531 actual 999999 pos 1183739 k 3 dist -0.002979
Bug on 397983 1 I 929964 actual 999999 pos 1193950 k 3 dist -0.064297
Bug on 397983 2 I 32925 actual 999999 pos 1193951 k 3 dist -0.064344
Bug on 936462 1 I 538900 actual 999999 pos 2809387 k 3 dist -0.048341
Bug on 936462 2 I 62374 actual 999999 pos 2809388 k 3 dist -0.058699
and also I found 1 image is failed. Now I am recomputing for the rest.
The "bug on" messages are 6 images out of 1M which got into numerical error and thus got distance = 0 those you should ignore. I am still not sure about the duplicates which got 1 but are not identical. Can you send me the top 3 rows of images that are shown as duplicates but are not? I want to run the computation on my side, all the tests pass on my machines (ubuntu 18+20+mac m1) I was not able to reproduce this error yet.
after removing that bad image, i got this duplicates:
still have the issue.
the log:
FastDup Software, (C) copyright 2022 Dr. Amir Alush and Dr. Danny Bickson.
Going to loop over dir /tmp/zara.txt
Found total 999998 images to run on
libpng warning: sRGB: out of place ] 31% Estimated: 66 Minutes
libpng warning: sRGB: out of place ] 42% Estimated: 54 Minutes
Wrote total of 999998 features , found 0 bad images] 100% Estimated: 0 Minutes
Found total 999998 images to run on
111037) Finished write_index() NN model
Stored nn model index file ../fastdup/out/nnf.index
Bug on 388602 1 I 169647 actual 999998 pos 1165807 k 3 dist -0.132486
Bug on 388602 2 I 255365 actual 999998 pos 1165808 k 3 dist -0.138604
Bug on 390441 1 I 774903 actual 999998 pos 1171324 k 3 dist -0.017260
Bug on 390441 2 I 161661 actual 999998 pos 1171325 k 3 dist -0.017907
Bug on 568423 1 I 659105 actual 999998 pos 1705270 k 3 dist -0.044372
Bug on 568423 2 I 755145 actual 999998 pos 1705271 k 3 dist -0.044996
Bug on 596245 2 I 150585 actual 999998 pos 1788737 k 3 dist -0.000455
Bug on 779143 1 I 2287 actual 999998 pos 2337430 k 3 dist -0.058276
Bug on 779143 2 I 62638 actual 999998 pos 2337431 k 3 dist -0.058675
Bug on 946843 1 I 197592 actual 999998 pos 2840530 k 3 dist -0.047367
Bug on 946843 2 I 474090 actual 999998 pos 2840531 k 3 dist -0.049710
1659980440 : INFO: (add_vertices:460): Num vertices for group 0: 999998
1659980440 : INFO: (commit_edge_buffer:609): In commit edge buffer (0,0)
1659980440 : INFO: (commit_edge_buffer:680): Shuffling edges ...
1659980440 : INFO: (commit_edge_buffer:688): Done shuffling edges in 0.04235 secs
1659980440 : INFO: (commit_edge_buffer:692): Aggregating unique vertices...
1659980440 : INFO: (commit_edge_buffer:705): Done aggregating unique vertex in 0.022127 secs
1659980440 : INFO: (commit_edge_buffer:713): Combine vertex data
1659980440 : INFO: (commit_edge_buffer:779): Done phase 2 in 0.062082 secs
1659980440 : INFO: (commit_edge_buffer:787): Rename id columns
1659980440 : INFO: (commit_edge_buffer:890): Done in 0.131735 secs
1659980440 : INFO: (commit_edge_buffer:892): Finish committing edge in 0.258471 secs
1659980440 : INFO: (add_edges:584): Num vertices for group 0: 999998
Num vertices for group 0: 999998
Num edges 0 -> 0: 117374
1659980441 : PROGRESS: (_p:516): +-----------------------------+
1659980441 : PROGRESS: (_p:516): | Number of components merged |
1659980441 : PROGRESS: (_p:516): +-----------------------------+
1659980441 : PROGRESS: (_p:516): | 72425 |
1659980441 : PROGRESS: (_p:516): | 0 |
1659980441 : PROGRESS: (_p:516): +-----------------------------+
1659980442 : PROGRESS: (triple_apply_pagerank:69): Counting out degree
1659980442 : PROGRESS: (triple_apply_pagerank:78): Done counting out degree
1659980442 : PROGRESS: (_p:516): +-----------+-----------------------+
1659980442 : PROGRESS: (_p:516): | Iteration | L1 change in pagerank |
1659980442 : PROGRESS: (_p:516): +-----------+-----------------------+
1659980442 : PROGRESS: (_p:516): | 1 | 771632 |
1659980442 : PROGRESS: (_p:516): | 2 | 4258.05 |
1659980442 : PROGRESS: (_p:516): | 3 | 2664.38 |
1659980442 : PROGRESS: (_p:516): | 4 | 1918.96 |
1659980442 : PROGRESS: (_p:516): | 5 | 1438.16 |
1659980442 : PROGRESS: (_p:516): | 6 | 1133.33 |
1659980442 : PROGRESS: (_p:516): | 7 | 907.308 |
1659980442 : PROGRESS: (_p:516): | 8 | 737.427 |
1659980442 : PROGRESS: (_p:516): | 9 | 603.873 |
1659980442 : PROGRESS: (_p:516): | 10 | 499.224 |
1659980442 : PROGRESS: (_p:516): | 11 | 414.308 |
1659980442 : PROGRESS: (_p:516): | 12 | 346.034 |
1659980442 : PROGRESS: (_p:516): | 13 | 289.312 |
1659980442 : PROGRESS: (_p:516): | 14 | 242.876 |
1659980442 : PROGRESS: (_p:516): | 15 | 203.985 |
1659980442 : PROGRESS: (_p:516): | 16 | 171.77 |
1659980443 : PROGRESS: (_p:516): | 17 | 144.69 |
1659980443 : PROGRESS: (_p:516): | 18 | 122.08 |
1659980443 : PROGRESS: (_p:516): | 19 | 103.042 |
1659980443 : PROGRESS: (_p:516): | 20 | 87.0725 |
1659980443 : PROGRESS: (_p:516): +-----------+-----------------------+
Wrote total of 999998 components
Total time took 5673428 ms
Found a total of 66513 fully identical images (d>0.990), which are 2.22 %
Found a total of 49974 nearly identical images(d>0.980), which are 1.67 %
Found a total of 1162877 above threshold images (d>0.900), which are 38.76 %
Found a total of 99999 outlier images (d<0.050), which are 3.33 %
Min distance found 0.001 max distance 1.000
Hi @lolongcovas we just released another version 0.127 for ubuntu 18 please try again and let us know if this works!!
now with 0.127 worked. thanks!
hi, sorry again. on the position 187 i found wrong duplicates:
however, 184 and 189 are ok.
This is totally strange and I can't reproduce it on my side. Are you open to setting up a zoom meeting or communicating in the slack channel so we can try and reproduce it together? Once I reproduce the problem I am sure I can solve it. But it is hard to reproduce without having the data.
hi @lolongcovas we have released version 0.130 which tries to fix the issue observed, please try it out.
Solved.