python-goose
python-goose copied to clipboard
Read article content using goose retrieving nothing
I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue.
Goose version used:https://github.com/agolo/python-goose/ Present version gives some errors.
from goose import Goose from requests import get
response = get('http://www.highbeam.com/doc/1P3-979471971.html') extractor = Goose() article = extractor.extract(raw_html=response.content) text = article.cleaned_text print text Link to same question asked at stackoverflow: http://stackoverflow.com/questions/30381944/read-article-content-using-goose-retrieving-nothing
from goose import Goose g = Goose() url = 'http://www.highbeam.com/doc/1P3-979471971.html' article = g.extract(url=url) article.title u'Tamil Nadu appoints Commission to look into Chennai stampede deaths - Hindustan Times (New Delhi, India)'
@JoshYuJump the OP was talking about the cleaned_text field which is the body of the article, not title.
I posted an answer on SOF, it's not a bulletproof solution but indeed Goose uses it as a part of the algorithm.