python-goose Read article content using goose retrieving nothing

Read article content using goose retrieving nothing

Open abhigenie92 opened this issue 10 years ago • 2 comments

I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue.

Goose version used:https://github.com/agolo/python-goose/ Present version gives some errors.

from goose import Goose from requests import get

response = get('http://www.highbeam.com/doc/1P3-979471971.html') extractor = Goose() article = extractor.extract(raw_html=response.content) text = article.cleaned_text print text Link to same question asked at stackoverflow: http://stackoverflow.com/questions/30381944/read-article-content-using-goose-retrieving-nothing

May 21 '15 23:05 abhigenie92

from goose import Goose g = Goose() url = 'http://www.highbeam.com/doc/1P3-979471971.html' article = g.extract(url=url) article.title u'Tamil Nadu appoints Commission to look into Chennai stampede deaths - Hindustan Times (New Delhi, India)'

May 25 '15 06:05 JoshYuJump

@JoshYuJump the OP was talking about the cleaned_text field which is the body of the article, not title.

I posted an answer on SOF, it's not a bulletproof solution but indeed Goose uses it as a part of the algorithm.

Jun 08 '15 08:06 ThiemNguyen

python-goose python-goose copied to clipboard

Read article content using goose retrieving nothing

python-goose
python-goose copied to clipboard