python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

Forbes.com text extraction gives redundant date in some cases

Open ethan-hunt-007 opened this issue 9 years ago • 1 comments

While extraction from Forbes.com not getting the needed data and getting unnecessary data in many cases . Here the code

>>>from goose import Goose
>>> g=Goose()
>>> art=g.extract(url='http://www.forbes.com/2009/03/18/federal-funds-commerce-ibm-markets-transcript-aig.html')
>>> art.title
u"Full Text: Edward Liddy's Testimony Before Congress"
>>> art.cleaned_text
u'Katy Perry earned $135 million this year--more than any other entertainer on Earth.'

I am getting the same text in many links of this type. What can be the issue and how to correct this???

ethan-hunt-007 avatar Jul 19 '15 08:07 ethan-hunt-007

@ethan-hunt-007 Forbes is largely incompatible with text extractors like goose, newspaper, etc, because their current site uses Javascript to render most of the webpage. That means, you'd have to render the page in a headless browser of some sort, let the JS run, and then extract the text / data. That's a lot more work and probably wouldn't be terribly performant.

mhamann avatar Jul 07 '16 20:07 mhamann