htmldate ignore undateable domains more intentionally

ignore undateable domains more intentionally

Open rahulbot opened this issue 2 years ago • 7 comments

In our testing the current code produces unreliable results when tested on Wikipedia articles. Sometimes it returns a data, sometimes it doesn't. Wikipedia articles are constantly updated, so @coreydockser and I would like to propose to change it so it returns no date if the URL is a wikipedia.org one. In our broader experience with Media Cloud this produces more useful results (for our open web news analysis context).

In terms of implementation, we could just copy filter_url_for_undateable function from date_guesser and use that as is to include the other checks it does for undateable domains. We'd call it early on in guess_date.

Aug 02 '21 15:08 rahulbot

Hi @rahulbot, it would be OK but I'd prefer to get to chance to tackle the problem first. There is certainly a field in the HTML where the date can be extracted from, would you mind giving examples of pages where the result wasn't as expected?

Aug 02 '21 17:08 adbar

@coreydockser can you please provide an example of a wikipedia page that does return a publication date, and one that does not?

Aug 11 '21 14:08 rahulbot

Sorry for the delay, I ran into some odd issues of my own making. Anyways, here's a sample of four articles with different results.

https://en.wikipedia.org/wiki/Among_Us – returns None (this is the behavior we want)

https://en.wikipedia.org/wiki/January_1969 – returns 2018-06-19, this date appears as datePublished in the html

https://en.wikipedia.org/wiki/F-scale_(personality_test) - returns 2005-07-05. the datePublished on this page is 2005-07-25, though, so I'm unsure where it came from.

https://en.wikipedia.org/wiki/2021_United_States_Capitol_attack - 2021-01-06, this is the date of the event, but it's also the datePublished.

Aug 23 '21 16:08 coreydockser

@coreydockser Thanks, I'll look at it and see if I can find a solution.

Aug 24 '21 13:08 adbar

Hi @coreydockser, I checked the cases and I don't agree with you at all:

A few results were different (maybe you didn't try the last version).
Besides, None cannot possibly be the expected behavior since there is information to be found in the page.
Most importantly, htmldate extracts both modified and original dates correctly, that is here the last edit and page creation dates.

So I fail to grasp where the problem lies, could you please be more specific and/or provide further examples for other websites?

Sep 14 '21 11:09 adbar

The library version issue could explain some of those specific results. However the second piece is more of a question of your intentions. In our projects, "publication date" means the date a news article was listed as being published online. That is rooted in ideas from the historical news industry (despite edits and iterations of online stories becoming more commonplace). Wikipedia articles are meant to be living documents, so for us they don't have a "publication date" in that sense. This is important for our time-series based analysis of news attention.

So I guess the one way to state the question is like this: for this library do you intend "publication date" to have a technology-informed definition such as the date of last edit? Or do you want a more "news-ish" definition like we use?

It sounds like it is more the former, in which case there are no "undateable" domains. If that is what you intend, then we can close this issue as won't-fix and we can handle the idea of "undateable" domains based on our project definition in our own code before we pass content into htmldate.

Thanks for any clarifications and your great work on this library!

Oct 07 '21 15:10 rahulbot

Thanks for the explanations, I get your point. Indeed, htmldate mostly provides a technology-informed concept of datation. It hopefully intersects the news-ish definition in most cases, however the two may vary.

I guess it would be possible to focus on a "news-ish" understanding of publication date by setting an additional parameter prior to the extraction. What would be the formal requirements for it to happen?

I'm leaving this thread open to see if we can address the issue.

Oct 15 '21 18:10 adbar

htmldate htmldate copied to clipboard

ignore undateable domains more intentionally

htmldate
htmldate copied to clipboard