Radim Řehůřek comments

Results 317 comments of


                                            Radim Řehůřek

trafficstars

Historical word embeddings

No, I mean a dictionary where the key is a particular model name string (year?) and value the relevant Python object (Word2Vec or whatever). If, as you say, the models...

Historical word embeddings

Aha, I see. Yes, that is a possibility -- if the models are sufficiently small, we could pickle everything as a single `dict` (no separate .npy files etc).

Biomedical pre-trained word embeddings

Thanks @gbrokos ! That's definitely useful. What we also need is a clear description of the preprocessing (especially since this is biomedical domain, where good tokenization / phrase detection is...

Biomedical pre-trained word embeddings

CC @mpenkov

Add web corpus and pre-trained models

What does "super-large" mean, can you be more specific? *EDIT*: If I'm reading the article correctly, we seem to need 8.97 TiB for the 57800 files in WET (plaintext) format....

Add web corpus and pre-trained models

OK, this one seems to be a challenge :-) Maybe subsample?

Add web corpus and pre-trained models

Yes. Size: probably a few GBs of bz2 plaintext or JSON.

Add cui2vec embeddings

Nice find!

Add cui2vec embeddings

Thanks guys. What we want is for users who download this dataset to be able to use it easily. If the dataset requires users to jump through hoops, it's not...

Add cui2vec embeddings

No problem, as long as the process is clearly described to users, and the dataset ready-to-use out of the box.