pynndescent icon indicating copy to clipboard operation
pynndescent copied to clipboard

pynndescent is changing the input data, is this expecte?

Open troilus-canva opened this issue 3 years ago • 7 comments

It seems if the metirc is dot, the input data will be normalized, is this expected? https://github.com/lmcinnes/pynndescent/blob/master/pynndescent/pynndescent_.py#L743

troilus-canva avatar Dec 10 '21 03:12 troilus-canva

Can confirm–this happens if the metric is dot and the input data is np.float32. If it's different dtype, the check_array call on L684 has already made a copy of the input. This is definitely a bug.

jamestwebber avatar Jan 03 '22 01:01 jamestwebber

I have fixed the aspect that @jamestwebber pointed out -- we shouldn't alter the users data, that's definitely a bug.

With regard to "dot" as a metric -- it is essentially assuming you really want something like an angular or cosine distance, and, indeed, the actual cosine distance exploits this and uses dot under the hood. Is there a use case for dot product distance measures with non-normalized data? We can always add a different name to handle that.

lmcinnes avatar Jan 05 '22 17:01 lmcinnes

I have fixed the aspect that @jamestwebber pointed out -- we shouldn't alter the users data, that's definitely a bug.

With regard to "dot" as a metric -- it is essentially assuming you really want something like an angular or cosine distance, and, indeed, the actual cosine distance exploits this and uses dot under the hood. Is there a use case for dot product distance measures with non-normalized data? We can always add a different name to handle that.

I guess there are use cases of dot product with non-normalized data, and the "dot" name itself is not implicating there is a normalization process there.

troilus-canva avatar Jan 05 '22 23:01 troilus-canva

Is there a use case for dot product distance measures with non-normalized data?

Personally I have no idea, but this is probably a separate issue from the bug that you fixed.

jamestwebber avatar Jan 05 '22 23:01 jamestwebber

@troilus-canva : That makes sense I guess; could you suggest a naming convention you would like>

lmcinnes avatar Jan 06 '22 17:01 lmcinnes

@troilus-canva : That makes sense I guess; could you suggest a naming convention you would like>

normalize_dot?

troilus-canva avatar Jan 07 '22 00:01 troilus-canva

Was a true un-normalized dot product metric ever added? This would be useful in cases where the original embedding was created with this in mind (for example, a matrix factorization).

ravimody avatar Mar 22 '24 03:03 ravimody