dejavu
dejavu copied to clipboard
Ranking algorithm not working properly for files of different sizes
Let's say I have the following original tracks:
Track A: 5 hours Track B: 3 minutes
Now I'm trying to match 10 seconds of audio from Track B. The current ranking algorithm will now favour Track A:
{'song_id': 144, 'song_name': 'TrackA.wav', 'input_total_hashes': 406, 'fingerprinted_hashes_in_db': 1, 'hashes_matched_in_input': 1621, 'input_confidence': 3.99, 'fingerprinted_confidence': 1621.0, 'offset': 719479, 'offset_seconds': 33412.5395, 'file_sha1': 'A64696103620CAD306B320F64CED8749033B84F9', 'length': 11543}
As you can see there is an input confidence of 4
, which means in average each single hash has matches 4 times here. As the file is huge it's very likely that fingerprints will match at some point (at least once), which distorts the ranking.
Suggestion:
https://github.com/worldveil/dejavu/blob/e56a4a221ad204654a191d217f92aebf3f058b62/dejavu/init.py#L197
In this line there is an argument that is being completely ignored, which is aligned_matches
. I think that aligned_matches
should play a major role for the ranking.