scikit-hubness NGT performance

Approx. neighbor search with ngtpy can be accelerated:

[x] Enable AVX on MacOS (temporarily disabled due to upstream bug in NGT. It is already enabled on Linux.)
[x] Use NGT's optimization step (until then, the method is actually (P)ANNG, not ONNG, I assume). Currently, this seems to be possible only via the cmd line tools, not via the Python API.
[ ] Set good default parameters for ONNG

Sep 09 '19 11:09 VarIr

It seems ONNG can be enabled in ngtpy, but it is currently not documented. However, there is an example here: https://github.com/yahoojapan/NGT/issues/30

Sep 23 '19 08:09 VarIr

New NGT release 1.7.10 should fix this: https://github.com/yahoojapan/NGT/releases/tag/v1.7.10

Sep 26 '19 09:09 VarIr

1.8.0 brought docs for ONNG. It is already activate here, but index building is extremely slow due to difficult parameterization. Need to check.

Nov 11 '19 09:11 VarIr

Approx. neighbor search with ngtpy can be accelerated:

[x] Enable AVX on MacOS (temporarily disabled due to upstream bug in NGT. It is already enabled on Linux.)

[x] Use NGT's optimization step (until then, the method is actually (P)ANNG, not ONNG, I assume). Currently, this seems to be possible only via the cmd line tools, not via the Python API.

[ ] Set good default parameters for ONNG

Hi, Seems like really good work.

I am using bert to find semantic similarity using cosine distance, but it may lead to high dimension problem. So can I use hubness here, I mean will it make bert embedding any better?

Thankyou!

Jan 24 '21 14:01 jaytimbadia

Thanks for your interest. That's something I've been thinking about, but never found time to actually check.

BERT embeddings are typically high-dimensional, so hubness might play a role. You could first estimate the intrinsic dimension of these embeddings (b/c this actually drives hubness), e.g. with this method. If this is much lower than the embedding dimension, it's unlikely that hubness reduction leads to improvements. Alternatively, you could directly compare performance in your tasks with and without hubness reduction. If there's a performance improvement, I'd be curious to know.

Jan 24 '21 16:01 VarIr

Thanks for your interest. That's something I've been thinking about, but never found time to actually check.

BERT embeddings are typically high-dimensional, so hubness might play a role. You could first estimate the intrinsic dimension of these embeddings (b/c this actually drives hubness), e.g. with this method. If this is much lower than the embedding dimension, it's unlikely that hubness reduction leads to improvements. Alternatively, you could directly compare performance in your tasks with and without hubness reduction. If there's a performance improvement, I'd be curious to know.

Thank you so much for the reply. I calculated the intrinsic dimension for bert and its coming out to be 18, very low than I expected. Anyways, one question, can we use intrinsic dimensionality to check the quality of embeddings we generate? For Eg: bert -> (100, 768) has pretty low, so what does it mean, while some random matrix -> (100, 768) I gave had around 155. So what it means is bert quite well trained?

If yes, we can use this, I mean whenever we generate embeddings we can check its intrinsic dimension if less, so less constraint it has, easier to fine-tune further, right?

I would love to know your thoughts!!

Jan 24 '21 17:01 jaytimbadia

18 isn't particularly high, but we've seen datasets, where this came with high hubness (see e.g. p. 2885/6 of this previous paper. I am not aware of research directly linking intrinsic dimension to the quality (however this would be defined, anyway) of embeddings. Interesting research questions you pose there :)

Jan 24 '21 17:01 VarIr

scikit-hubness scikit-hubness copied to clipboard

NGT performance

scikit-hubness
scikit-hubness copied to clipboard