Goose is non-functional in Python 3
Title is largely self explanatory.
Primary limitation seems to be reliance on BeautifulSoup 3, which has been EOL for quite a while now, and really should be migrated away from.
~~Actually, where is beautifulsoup used at all? I can't find any reference in the codebase to it at all~~ It's being used in lxml somewhere, somehow, despite no explicit mention of it anywhere.
Also, unittest sucks, and doesn't report anything informative when you have an importerror. You can apparently use nosetests to run the same tests with sane output.
jieba can be replaced with jieba3k.
Going through everything, it appears that the heavy dependency on soupparser is a problem. Runtime patching in bs4 instead of bs3 is not workable, since lxml uses invalid arguments to __init__.
I have unit tests working.
Ran 126 tests in 10.607s
FAILED (errors=54, failures=49)
Welp! Time to look at other text extractors.
Is there any timeline on python 3 compatibility?
+1 for python 3 support... Is there any schedule? Or you don't care at all?
@kotrfa - It's not a direct equivalent, but I wound up using python-readability for text extraction. It works well enough.
Prepare PR to add py3 support: https://github.com/grangier/python-goose/pull/220
+1 for this. Why uses Python 2!?
Still waitting for Python 3 support :)
I believe this project is dead. Use https://github.com/codelucas/newspaper instead, which is inspired by goose and supports Python 3 flawlessly.
Yep, I already knew it but I just wanted to do some comparison of the available tools. Indeed, I will use it. Thanks!
Any plans to introduce Python 3 support to this project?
Any plans to introduce Python 3 support to this project?
Hi everyone, this may come off as self promotion, but I went ahead and forked goose to work with python3. http://github.com/goose3/goose3 Enjoy