newspaper
newspaper copied to clipboard
Not woking on "nytimes.com"
I tried few articles from NYtimes.com but it is able to parse half article and missing first half Example urls: url 1 url2
Did you check the website to make sure that you haven't reached the max free articles that you are allowed to see for the month?
@dlundergreen I don't remember when was the last time I opened NYTimes before. That means I am sure not crossed the limit.
This also happens for other links. For example, on this URL only a part of the body is parsed. Is this because the individual <p>
elements are in different parent <div>
's?
NYTimes articles are over 2 DIVs and generally the second one is bigger making newspaper picking it.
anyone was able to solve this ?
I found that changing PARENT_DECAY to 1.0 make it for NYT
@Cabu I couldn't found a variable named PARENT_DECAY on master branch, so where is this located ?
@loaighoraba
paper = newspaper.build(source_url, PARENT_DECAY=1.0)
@Cabu seems this is changed in the master branch, there is no such variable.
@loaighoraba Ho yes. I see, now it seems to be hardcoded in extractor.py line 825 :/ Having it as a 'hidden' feature was practical for sources like the NYT.
@Cabu I see, however this won't solve the issue if the common parent is more than two levels up, thanks for this anyway.
Not sure if anyone is watching for updates on this issue but my linked PR has been tested with both URLs here. Happy to hear feedback/suggestions on it 👍🏽