python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features

Open Lol4t0 opened this issue 9 years ago • 4 comments

  • As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly
  • With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing. Russian is an example. So this fixes #223
  • Python 3 support (#220 merged)
  • Move to requests library for http backend. This makes #244, #237, #64 obsolete and fixes some issues in the tracker
  • Analyze all possible text root nodes and select best one, do not stop on first text root node candidate
  • Improve text selection filters

Lol4t0 avatar Nov 13 '15 15:11 Lol4t0

@grangier please merge this, Python 3 compatibility would be great to have

andreis avatar Mar 15 '16 15:03 andreis

@grangier +1 on merging this PR. Python3 support is really needed.

adityarustgi avatar May 02 '16 19:05 adityarustgi

@grainger Pleas merge, we are no more using python2x

sandeepsayone avatar May 30 '16 09:05 sandeepsayone

FYI, I've produced a pypi package goose3 that can be found at https://github.com/goose3/goose3

I appreciate all the work that @grangier has done, but I really needed goose to work on python3. If you'd like to fix any bugs, tests, etc I'm more than happy to put in time to look at pull requests and merge them. Thank you.

lababidi avatar Mar 28 '17 20:03 lababidi