tlsh icon indicating copy to clipboard operation
tlsh copied to clipboard

meaning of extra_constant value in search tlshCluster

Open TheNha opened this issue 4 years ago • 1 comments

Hi. Im reading tlshCluster that you publish recently. I don't understand the extra_constant value, it in function VPTSearch in file hac_lib.py. Can you help me explain this value? Thank you very much.

TheNha avatar Oct 15 '21 02:10 TheNha

Might be related to #130 ? I was looking into vantage point trees and trying to understanding how they work. [1] [2] [3] When testing I found that the tree sometimes didn't return the nearest object if I lowered the extra_constant. If I increased it instead, I did perform more comparisons. In my understanding, it functions like some error margin and 20 might be some experimental optimal value? It could be related to the text length difference penalty that is also included in the distance score.

[1] http://stevehanov.ca/blog/index.php?id=130 [2] https://fribbels.github.io/vptree/writeup [3] http://pnylab.com/papers/vptree/main.html

Here some example code: https://gist.github.com/Querela/d34d76bf090863418168527bc5aba3ff (NOTE: I did some cleanup since it contained a lot more other stuff but did not run it again. It might be missing some imports? Just write me. But you can simply try out some different values if you run it in some interactive shell.)

Querela avatar Jun 07 '23 00:06 Querela