scikit-hubness icon indicating copy to clipboard operation
scikit-hubness copied to clipboard

NGT performance

Open VarIr opened this issue 6 years ago • 7 comments

Approx. neighbor search with ngtpy can be accelerated:

  • [x] Enable AVX on MacOS (temporarily disabled due to upstream bug in NGT. It is already enabled on Linux.)
  • [x] Use NGT's optimization step (until then, the method is actually (P)ANNG, not ONNG, I assume). Currently, this seems to be possible only via the cmd line tools, not via the Python API.
  • [ ] Set good default parameters for ONNG

VarIr avatar Sep 09 '19 11:09 VarIr

It seems ONNG can be enabled in ngtpy, but it is currently not documented. However, there is an example here: https://github.com/yahoojapan/NGT/issues/30

VarIr avatar Sep 23 '19 08:09 VarIr

New NGT release 1.7.10 should fix this: https://github.com/yahoojapan/NGT/releases/tag/v1.7.10

VarIr avatar Sep 26 '19 09:09 VarIr

1.8.0 brought docs for ONNG. It is already activate here, but index building is extremely slow due to difficult parameterization. Need to check.

VarIr avatar Nov 11 '19 09:11 VarIr

Approx. neighbor search with ngtpy can be accelerated:

  • [x] Enable AVX on MacOS (temporarily disabled due to upstream bug in NGT. It is already enabled on Linux.)
  • [x] Use NGT's optimization step (until then, the method is actually (P)ANNG, not ONNG, I assume). Currently, this seems to be possible only via the cmd line tools, not via the Python API.
  • [ ] Set good default parameters for ONNG

Hi, Seems like really good work.

I am using bert to find semantic similarity using cosine distance, but it may lead to high dimension problem. So can I use hubness here, I mean will it make bert embedding any better?

Thankyou!

jaytimbadia avatar Jan 24 '21 14:01 jaytimbadia

Thanks for your interest. That's something I've been thinking about, but never found time to actually check.

BERT embeddings are typically high-dimensional, so hubness might play a role. You could first estimate the intrinsic dimension of these embeddings (b/c this actually drives hubness), e.g. with this method. If this is much lower than the embedding dimension, it's unlikely that hubness reduction leads to improvements. Alternatively, you could directly compare performance in your tasks with and without hubness reduction. If there's a performance improvement, I'd be curious to know.

VarIr avatar Jan 24 '21 16:01 VarIr

Thanks for your interest. That's something I've been thinking about, but never found time to actually check.

BERT embeddings are typically high-dimensional, so hubness might play a role. You could first estimate the intrinsic dimension of these embeddings (b/c this actually drives hubness), e.g. with this method. If this is much lower than the embedding dimension, it's unlikely that hubness reduction leads to improvements. Alternatively, you could directly compare performance in your tasks with and without hubness reduction. If there's a performance improvement, I'd be curious to know.

Thank you so much for the reply. I calculated the intrinsic dimension for bert and its coming out to be 18, very low than I expected. Anyways, one question, can we use intrinsic dimensionality to check the quality of embeddings we generate? For Eg: bert -> (100, 768) has pretty low, so what does it mean, while some random matrix -> (100, 768) I gave had around 155. So what it means is bert quite well trained?

If yes, we can use this, I mean whenever we generate embeddings we can check its intrinsic dimension if less, so less constraint it has, easier to fine-tune further, right?

I would love to know your thoughts!!

jaytimbadia avatar Jan 24 '21 17:01 jaytimbadia

18 isn't particularly high, but we've seen datasets, where this came with high hubness (see e.g. p. 2885/6 of this previous paper. I am not aware of research directly linking intrinsic dimension to the quality (however this would be defined, anyway) of embeddings. Interesting research questions you pose there :)

VarIr avatar Jan 24 '21 17:01 VarIr