newspaper4k icon indicating copy to clipboard operation
newspaper4k copied to clipboard

implementing 3rd method of publish date extraction (issue # 521)

Open AndyTheFactory opened this issue 2 years ago • 0 comments

Issue by zachorban Wed Apr 11 16:33:54 2018 Originally opened as https://github.com/codelucas/newspaper/pull/549


I tested the original version and this modified version on 120 different news articles across multiple sites. This version successfully parsed 119 of the publish dates, compared to 84 from the previous version. On average, the runtime increase is only marginal for the accuracy increase.

Final statistics from testing:


urls that failed both versions:

https://www.bbcgoodfood.com/howto/guide/top-10-retro-british-desserts

success rates of original/modified versions:

original version successes: 84 / 120
updated version successes: 119 / 120
accuracy improvement percentage: 0.41666666666666674

average abolute/relative runtime differences:

dataset: | absolute difference: | relative difference:
original: | 0.004801957380203973 | 0.008619953698073361
new : | 0.01785543305533273 | 0.1452353506990094
fails: | 0.0388028621673584 | 0.19184348967941411
total: | 0.00889256199200948 | 0.04999297395652421

.

The url that failed is one that was linked to on an aggregated-link-based news site. On inspection of the source code, no evidence of a publish date was found.

Statistic explanation:

Datasets: - original: urls that worked in the original version - new: urls that failed the original, but passed in the new version - fails: urls that failed both versions - total: all urls used Abolute Difference: - runtime differences from executing article.parse() - calculated using python's time.time() function - difference = new runtime - original runtime Relatvie Difference: - difference = absolute difference / original runtime

Summary: The modified version found publish dates for ~42% more articles and, on average, increased the runtime of article.parse() by only ~5%


zachorban included the following code: https://github.com/codelucas/newspaper/pull/549/commits

AndyTheFactory avatar Oct 24 '23 12:10 AndyTheFactory