python-goose
python-goose copied to clipboard
Forbes.com text extraction gives redundant date in some cases
While extraction from Forbes.com not getting the needed data and getting unnecessary data in many cases . Here the code
>>>from goose import Goose
>>> g=Goose()
>>> art=g.extract(url='http://www.forbes.com/2009/03/18/federal-funds-commerce-ibm-markets-transcript-aig.html')
>>> art.title
u"Full Text: Edward Liddy's Testimony Before Congress"
>>> art.cleaned_text
u'Katy Perry earned $135 million this year--more than any other entertainer on Earth.'
I am getting the same text in many links of this type. What can be the issue and how to correct this???
@ethan-hunt-007 Forbes is largely incompatible with text extractors like goose, newspaper, etc, because their current site uses Javascript to render most of the webpage. That means, you'd have to render the page in a headless browser of some sort, let the JS run, and then extract the text / data. That's a lot more work and probably wouldn't be terribly performant.