Radim Řehůřek
Radim Řehůřek
We haven't but that's an interesting idea. Somebody would need to pick it up and execute though, with strong SW engineering. It's not trivial because we definitely want to keep...
> I don't download that much actually and I actually have very fast internet. Interesting. What motivated you to open this ticket then, how did you think of it?
Sure, why not. I'm +1 on including those. Please check https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model; we'll need: a) Text that motivates adding each model (should be easy), including any links to its original research...
Sure, why not. Can you train it & open a PR? In the PR, please include **clear motivation** and **all scripts you used**, so the result is reproducible and its...
Thanks for reporting. That does sound like a bug to me. CC @mpenkov can you please have a look?
Ping @mpenkov -- this is the same issue as on that mailing list (I knew I already saw it somewhere!). Really confusing behaviour.
AFAIR, it's this code in `__init__.py` inside the `fasttext-wiki-news-subwords-300` release: https://github.com/RaRe-Technologies/gensim-data/releases/tag/fasttext-wiki-news-subwords-300 @mpenkov can you confirm?
I'm curious too. @mpenkov can you please have a look? I know you reworked and clarified our FastText recently, thanks.
Another related resource, the PubMed Central dataset: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ (found via http://deepdive.stanford.edu/opendata/) Unlike the metadata above, this (smaller) dataset also contains the **article full texts**. Around 360,000 medical articles with full...
Another related free (non-commercial use) bio medical corpus, **including full text**: https://old.biomedcentral.com/about/datamining