python-goose
python-goose copied to clipboard
Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features
- As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly
- With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing. Russian is an example. So this fixes #223
- Python 3 support (#220 merged)
- Move to requests library for http backend. This makes #244, #237, #64 obsolete and fixes some issues in the tracker
- Analyze all possible text root nodes and select best one, do not stop on first text root node candidate
- Improve text selection filters
@grangier please merge this, Python 3 compatibility would be great to have
@grangier +1 on merging this PR. Python3 support is really needed.
@grainger Pleas merge, we are no more using python2x
FYI, I've produced a pypi package goose3
that can be found at https://github.com/goose3/goose3
I appreciate all the work that @grangier has done, but I really needed goose to work on python3. If you'd like to fix any bugs, tests, etc I'm more than happy to put in time to look at pull requests and merge them. Thank you.