Radim Řehůřek comments

Results 318 comments of


                                            Radim Řehůřek

trafficstars

Data release torrents

We haven't but that's an interesting idea. Somebody would need to pick it up and execute though, with strong SW engineering. It's not trivial because we definitely want to keep...

Data release torrents

> I don't download that much actually and I actually have very fast internet. Interesting. What motivated you to open this ticket then, how did you think of it?

Add GloVe pretrained models from CommonCrawl corpus

Sure, why not. I'm +1 on including those. Please check https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model; we'll need: a) Text that motivates adding each model (should be easy), including any links to its original research...

New Corpus - Semantic Scholar

Sure, why not. Can you train it & open a PR? In the PR, please include **clear motivation** and **all scripts you used**, so the result is reproducible and its...

Pretrained FastText doesn't handle OOV words

Thanks for reporting. That does sound like a bug to me. CC @mpenkov can you please have a look?

Pretrained FastText doesn't handle OOV words

Ping @mpenkov -- this is the same issue as on that mailing list (I knew I already saw it somewhere!). Really confusing behaviour.

Pretrained FastText doesn't handle OOV words

AFAIR, it's this code in `__init__.py` inside the `fasttext-wiki-news-subwords-300` release: https://github.com/RaRe-Technologies/gensim-data/releases/tag/fasttext-wiki-news-subwords-300 @mpenkov can you confirm?

Pretrained FastText doesn't handle OOV words

I'm curious too. @mpenkov can you please have a look? I know you reworked and clarified our FastText recently, thanks.

Add medical corpora + pretrained models

Another related resource, the PubMed Central dataset: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ (found via http://deepdive.stanford.edu/opendata/) Unlike the metadata above, this (smaller) dataset also contains the **article full texts**. Around 360,000 medical articles with full...

Add medical corpora + pretrained models

Another related free (non-commercial use) bio medical corpus, **including full text**: https://old.biomedcentral.com/about/datamining