Radim Řehůřek
Radim Řehůřek
Looks like your `title_tokens.txt.gz` file contains invalid utf8 -- can you check? That is, ignore gensim and any machine learning. Just iterate through the zip file, check that every line...
Thanks. Closing until there's a clearly demonstrated proof-of-concept or attack vector. Ideally with a mitigation PR where relevant.
@mxmlnkn @Rogdham @pauldmccarthy seeing the conversation above: is there any appetite to unify the [indexed_gzip](https://github.com/pauldmccarthy/indexed_gzip) + [indexed_bzip2](https://github.com/mxmlnkn/indexed_bzip2/tree/master/python/indexed_bzip2) + [python-xz](https://github.com/Rogdham/python-xz) approach? I mean in the sense of a unified API solution...
Sorry @julianpollmann – I ran the workflow now! @mpenkov mergeable or no?
@mpenkov how about a quick call to review this & other "maintenance" PRs?
Thanks @julianpollmann !
> setup.py now uses the project README text for long_description IIRC setup.py (PyPI) used to required the RST format, whereas our README uses the Markdown format. That's why we kept...