Radim Řehůřek
Radim Řehůřek
No, I mean a dictionary where the key is a particular model name string (year?) and value the relevant Python object (Word2Vec or whatever). If, as you say, the models...
Aha, I see. Yes, that is a possibility -- if the models are sufficiently small, we could pickle everything as a single `dict` (no separate .npy files etc).
Thanks @gbrokos ! That's definitely useful. What we also need is a clear description of the preprocessing (especially since this is biomedical domain, where good tokenization / phrase detection is...
CC @mpenkov
What does "super-large" mean, can you be more specific? *EDIT*: If I'm reading the article correctly, we seem to need 8.97 TiB for the 57800 files in WET (plaintext) format....
OK, this one seems to be a challenge :-) Maybe subsample?
Yes. Size: probably a few GBs of bz2 plaintext or JSON.
Nice find!
Thanks guys. What we want is for users who download this dataset to be able to use it easily. If the dataset requires users to jump through hoops, it's not...
No problem, as long as the process is clearly described to users, and the dataset ready-to-use out of the box.