compress-fasttext icon indicating copy to clipboard operation
compress-fasttext copied to clipboard

Revert the compressed vectors to gensim format

Open HossamAmer12 opened this issue 2 years ago • 10 comments

I am using this pre-trained model: ft_cc.en.300_freqprune_50K_5K_pq_100.bin

That's my code:

ft_gensim = compress_fasttext.models.CompressedFastTextKeyedVectors.load(org_model_path)
new_vocab = ft_gensim.key_to_index
new_vectors = ft_gensim.vectors
new_ngrams = ft_gensim.vectors_ngrams

print(type(new_vectors)) # <class 'compress_fasttext.navec_like.PQ'>
print(type(new_ngrams)) # <class 'compress_fasttext.prune.RowSparseMatrix'>
new_vectors = DecomposedMatrix.compress(new_vectors, n_components=100, fp16=True)
new_ngrams = DecomposedMatrix.compress(new_ngrams, n_components=100, fp16=True)

I get this error: ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (6,) + inhomogeneous part. Is there a way to convert the vectors and ngrams back to gensim format to do this compress operation?

HossamAmer12 avatar Oct 10 '23 23:10 HossamAmer12

I believe that should be the solution:

new_vocab = ft_gensim.key_to_index
new_vectors = ft_gensim.vectors.unpack()
new_ngrams = ft_gensim.vectors_ngrams.unpack()

That being said, this code increases the size of the original model because the final model after SVD will be unpacked. :(

HossamAmer12 avatar Oct 10 '23 23:10 HossamAmer12

Can you please explain once more, what final goal you want to achieve, and how would you want the solution to look like?

avidale avatar Oct 11 '23 09:10 avidale

My original goal is: (1) Take any language model from here and compress this model down to 2-3 MBs using the prune_ft_freq (2) Use this model and implement the word/sentence look-up without external dependencies.

Since compress-fast text is not building for me [PQ dependency], I am trying to use the posted ft_cc.en.300_freqprune_50K_5K_pq_100.bin and decrease the dimensions to 100 (or 150) so that I have a 2-3 MB fasttext model.

Then I can worry about #2 above later. Of course, if you have pointers, that'd be great. For example, the hashing function is not clear in compress fastttext lookup.

HossamAmer12 avatar Oct 11 '23 10:10 HossamAmer12

implement the word/sentence look-up without external dependencies

What do you mean by "without external dependencies"? You want to do the lookup in pure numpy, without gensim and compress_fasttext packages?

avidale avatar Oct 11 '23 10:10 avidale

Of course, if you have pointers, that'd be great. For example, the hashing function is not clear in compress fastttext lookup.

What kind of pointers do you need? And why do you want to mess with the hashing function?

avidale avatar Oct 11 '23 10:10 avidale

implement the word/sentence look-up without external dependencies

What do you mean by "without external dependencies"? You want to do the lookup in pure numpy, without gensim and compress_fasttext packages?

Yes, that's right in pure numpy. Similar to what's already done here -- If you could point the differences, that'd be appreciated.

HossamAmer12 avatar Oct 11 '23 11:10 HossamAmer12

Of course, if you have pointers, that'd be great. For example, the hashing function is not clear in compress fastttext lookup.

What kind of pointers do you need? And why do you want to mess with the hashing function?

Requested Pointers: 1- Can you help me narrow down ft_cc.en.300_freqprune_50K_5K_pq_100.bin model from 300 dim to 100? 2- Can you help me reproduce ft_cc.en.300_freqprune_50K_5K_pq_100.bin model? What are the steps to compress? compress-fasttext is not working for me.

For Hash function: I do not wish to mess with the function. But I want to know which hash function you are using. Can you provide pointer to its code?

Appreciate your responses :))

HossamAmer12 avatar Oct 11 '23 11:10 HossamAmer12

1- Can you help me narrow down ft_cc.en.300_freqprune_50K_5K_pq_100.bin model from 300 dim to 100?

This model already has internal product-quantized vectors in 100 dimensions, just as its name tells.

avidale avatar Oct 14 '23 20:10 avidale

2- Can you help me reproduce ft_cc.en.300_freqprune_50K_5K_pq_100.bin model? What are the steps to compress? compress-fasttext is not working for me.

I produced it with compress-fasttext. If it is not working for you, please describe to me how exactly to reproduce your problem, and I will fix it.

avidale avatar Oct 14 '23 20:10 avidale

I do not wish to mess with the function. But I want to know which hash function you are using. Can you provide pointer to its code?

The function is called ft_ngram_hashes, and here I import it from gensim (I try two different paths, because in different versions of gensim, the hash function is located at different places).

avidale avatar Oct 14 '23 20:10 avidale