metascraper
metascraper copied to clipboard
[WIP] Add published and modified dates
With this pull request I propose adding two additional dates to the date package - datePublished
and dateModified
.
Some websites use both dates for their articles. There are use cases when you want to know both if they exist or exactly these two not just a date found in the html.
Thanks for this!
My only concern is to add too many date fields as part of the payload.
I understand date
is too simple, so what do you think if we make this a configurable thing?
require('metascraper-date')({
datePublished: true,
dateModified: true
})
By doing that we can keep the original date behavior, but also make it possible to be more specific in case you want. Also to make this in a progressive way to see the adoption.
I'm thinking into two user cases for that:
- if
{ datePublished: false, dateModified: false }
, thenconst date = dateModified || datePublished
- if
{ datePublished: false }
or{ dateModified: false }
, thendate
is not returned.
waiting for your opinion 🙂
Yeah that's a good idea. Adding these dates as options is good approach and won't change the current behavior. But I was thinking of a simpler approach. What about this.
-
if
{ datePublished: false, dateModified: false }
, thendate
is returned the same way as now. Default behavior. I didn't include all the rules in these two dates. Only these that imply modified or created date. So there will be times, lots of them, whenconst date = dateModified || datePublished
won't return anything. Unless we reorganize the rules of course. -
if
{ datePublished: true} or { dateModified: true}
or bothtrue
, then just return what was requested alongsidedate
. Yeah it will add two more fields (at most) in the payload but it will be deliberate and requested from the users. Also it is not unique to this package only. Others also return more than one. So they can do something like this:const date = metascraper.datePublished || metascraper.dateModified || metascraper.date;
LGTM!
I think it's worth it to sort the rules to avoid repetition and make them easy to read.
Doing that, the default behavior could be easy as add the rules as fallbacks
const date = dateRules() || dateModifiedRules() || datePublishedRules()
Yep, should be cleaner this way. I will work when have more time and update the pull request.
Is this still going to be merged? I need this feature, so if there is still work to be done, let me know so I can do it
@ghmendonca do you want to lead this PR?
I will be happy to merge if after it's updated with the current codebase
@Kikobeats yes, leave it to me, will work on this later today
@ghmendonca feel free to open a new PR picking the changes from here 🙂
Hey guys, @Kikobeats @ghmendonca
This was totally forgotten. I have merged the latest changes I made back in the day. The rules are reordered as we agreed on and config options added. This will lead to breaking changes (the date will be different in certain test cases) as the rules are ordered differently than the previous version. If you want 1:1 backward compatibility we should keep the order for the date
the same as before. Tests should be adjusted and new ones added, docs updated too. But I think this will help to finish the PR faster.
Probably closest behaviour to the old order will be:
date: dateModifiedRules().concat(datePublishedRules(), dateRules())
I think this pretty much wraps up the pull request. The snapshot should be updated as I have problems with my setup to do it. And probably if you want date rules reorganized.
@madwings Thanks for taking care; I can finish it 🙂
@madwings @Heheehd @ghmendonca Do you know any site we can use for creating an integration test using these new rules?
@madwings @Heheehd @ghmendonca Do you know any site we can use for creating an integration test using these new rules?
Some of the domains we are scraping data from (date) too. If you want I can provide you with a lot more or even better - specific articles. https://www.dropbox.com/s/r27v47jr14kxva7/article_domains.txt?dl=0
Thanks a lot! Working to add some sites as part of the integration tests 🙂
@Kikobeats I was testing with this article here -> https://www.mentalfloss.com/article/90967/14-facts-about-wheres-waldo The published date and modified date are very different
@ghmendonca that's actually expected since the new rules set is the latest timestamp vs. first timestamp
For example, in a markup like this:
<meta itemprop="datePublished" content="2016-05-24T09:42:00.000Z">
<meta itemprop="dateModified" content="2016-05-24T21:27:05.000Z">
The date
field will be dateModified
. That is actually okay, but I want to double-check manually all is under this expectation.
Another pretty clear example:
data:image/s3,"s3://crabby-images/b6a50/b6a5097efdfdc9a82cbb7742181939a4cc981798" alt="CleanShot 2023-03-10 at 16 56 28@2x"
The new date
field is the latest modified timestamp rather than the publication date
Coverage: 98.068% (-0.2%) from 98.235% when pulling af1691e8171137c9c043e2e0a9b1d088d9c517a3 on madwings:feature/add-date into ff0e5d8194d0ff9d43e52991611d5fddfc2400b4 on microlinkhq:master.