Radim Řehůřek
Radim Řehůřek
Thanks for trying the beta and reporting! I don't think we can do much about sparsetools (scipy provides no alternative AFAIK), but we can definitely fix the `float`.
@raffaem can you run whatever steps you used before, but on the current `develop` branch of Gensim? We removed and fixed a bunch of code, so maybe this is not...
Removing from 4.0.0. Will revisit when @raffaem follows up.
Getting rid of (not showing) the sparsetools warning makes sense. But I don't think try-expect will help here – it's not an exception. As far as I'm aware scipy doesn't...
And in https://github.com/scipy/scipy/issues/5348 by the scipy team. Scipy (and the whole pydata ecosystem) is much easier to deploy and manage than it was 7 years ago. Plus gensim now compiles...
> In this way, `Phrases` will treat `European Comission` the same way it will treat `Comission There`. No – you pass in sentences (lists of tokens) to Phrases, not strings...
I don't know about non-word tokens. But definitely on full stops, to avoid that example of `Commission There` cross-sentence overlap.
Does running your processes with `OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1` fix the problem?
Thanks for reporting. Are you interested in figuring out the cause? All code lives in the [phrases](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py) module, and is fairly straightforward.
Thanks for looking into this. IIRC we went for strings to save on RAM, tuples introduce a lot memory overhead. These "phrases" models are memory-hungry, by the nature of what...