modAL icon indicating copy to clipboard operation
modAL copied to clipboard

Compute pairwise distances only once instead of each turn.

Open AlexandreAbraham opened this issue 5 years ago • 3 comments

This is a proposition of optimization for the ranked batch method. In the ranked batch methods, the distance between all unlabeled samples is computed at each iteration which is computationally prohibitive on large datasets. This modification simply update the minimum distance vector with each selected sample. Note that I do not remove the labelled sample from the vector to avoid memory reallocation and unnecessary complexity. However, one sample will not be selected twice as the distance to itself is 0.

I observe a difference in the performance on the ranked_batch example though.

Here is the performance history given master:

[0.3333333333333333, 0.8933333333333333, 0.84, 0.9266666666666666, 0.9333333333333333, 0.9466666666666667, 0.9533333333333334]

Here is the performance history on my branch: [0.3333333333333333, 0.8733333333333333, 0.8733333333333333, 0.9333333333333333, 0.9333333333333333, 0.9466666666666667, 0.9733333333333334]

I am still investigating this. Any insight is welcome.

AlexandreAbraham avatar Jan 29 '20 09:01 AlexandreAbraham

Hi! Thanks for the PR! I'll try to remove it soon and get back to you.

cosmic-cortex avatar Feb 02 '20 09:02 cosmic-cortex

Just a quick update, I am still working on reviewing it, not sure if I understand everything correctly, so I need to spend more time with it.

I have also noticed that in my previous comment, I wrote "I'll try to remove it soon" :D I actually meant to say I'll try to review it soon. Sorry for the potential confusion :)

cosmic-cortex avatar Feb 09 '20 14:02 cosmic-cortex

Hey, No problem, let me know of I can add comments to make it more readable.

AlexandreAbraham avatar Feb 10 '20 10:02 AlexandreAbraham