metascraper icon indicating copy to clipboard operation
metascraper copied to clipboard

[WIP] Add published and modified dates

Open madwings opened this issue 3 years ago • 4 comments

With this pull request I propose adding two additional dates to the date package - datePublished and dateModified. Some websites use both dates for their articles. There are use cases when you want to know both if they exist or exactly these two not just a date found in the html.

madwings avatar Feb 25 '21 20:02 madwings

Thanks for this!

My only concern is to add too many date fields as part of the payload.

I understand date is too simple, so what do you think if we make this a configurable thing?

require('metascraper-date')({
  datePublished: true,
  dateModified: true
})

By doing that we can keep the original date behavior, but also make it possible to be more specific in case you want. Also to make this in a progressive way to see the adoption.

I'm thinking into two user cases for that:

  1. if { datePublished: false, dateModified: false }, then const date = dateModified || datePublished
  2. if { datePublished: false } or { dateModified: false }, then date is not returned.

waiting for your opinion 🙂

Kikobeats avatar Feb 26 '21 08:02 Kikobeats

Yeah that's a good idea. Adding these dates as options is good approach and won't change the current behavior. But I was thinking of a simpler approach. What about this.

  1. if { datePublished: false, dateModified: false }, then date is returned the same way as now. Default behavior. I didn't include all the rules in these two dates. Only these that imply modified or created date. So there will be times, lots of them, when const date = dateModified || datePublished won't return anything. Unless we reorganize the rules of course.

  2. if { datePublished: true} or { dateModified: true} or both true, then just return what was requested alongside date. Yeah it will add two more fields (at most) in the payload but it will be deliberate and requested from the users. Also it is not unique to this package only. Others also return more than one. So they can do something like this: const date = metascraper.datePublished || metascraper.dateModified || metascraper.date;

madwings avatar Feb 26 '21 12:02 madwings

LGTM!

I think it's worth it to sort the rules to avoid repetition and make them easy to read.

Doing that, the default behavior could be easy as add the rules as fallbacks

const date = dateRules() || dateModifiedRules() || datePublishedRules()

Kikobeats avatar Feb 26 '21 16:02 Kikobeats

Yep, should be cleaner this way. I will work when have more time and update the pull request.

madwings avatar Feb 26 '21 17:02 madwings

Is this still going to be merged? I need this feature, so if there is still work to be done, let me know so I can do it

ghmendonca avatar Mar 08 '23 19:03 ghmendonca

@ghmendonca do you want to lead this PR?

I will be happy to merge if after it's updated with the current codebase

Kikobeats avatar Mar 08 '23 20:03 Kikobeats

@Kikobeats yes, leave it to me, will work on this later today

ghmendonca avatar Mar 08 '23 20:03 ghmendonca

@ghmendonca feel free to open a new PR picking the changes from here 🙂

Kikobeats avatar Mar 08 '23 22:03 Kikobeats

Hey guys, @Kikobeats @ghmendonca

This was totally forgotten. I have merged the latest changes I made back in the day. The rules are reordered as we agreed on and config options added. This will lead to breaking changes (the date will be different in certain test cases) as the rules are ordered differently than the previous version. If you want 1:1 backward compatibility we should keep the order for the date the same as before. Tests should be adjusted and new ones added, docs updated too. But I think this will help to finish the PR faster.

Probably closest behaviour to the old order will be:

date: dateModifiedRules().concat(datePublishedRules(), dateRules())

madwings avatar Mar 08 '23 23:03 madwings

I think this pretty much wraps up the pull request. The snapshot should be updated as I have problems with my setup to do it. And probably if you want date rules reorganized.

madwings avatar Mar 09 '23 15:03 madwings

@madwings Thanks for taking care; I can finish it 🙂

Kikobeats avatar Mar 09 '23 16:03 Kikobeats

@madwings @Heheehd @ghmendonca Do you know any site we can use for creating an integration test using these new rules?

Kikobeats avatar Mar 10 '23 09:03 Kikobeats

@madwings @Heheehd @ghmendonca Do you know any site we can use for creating an integration test using these new rules?

Some of the domains we are scraping data from (date) too. If you want I can provide you with a lot more or even better - specific articles. https://www.dropbox.com/s/r27v47jr14kxva7/article_domains.txt?dl=0

madwings avatar Mar 10 '23 12:03 madwings

Thanks a lot! Working to add some sites as part of the integration tests 🙂

Kikobeats avatar Mar 10 '23 12:03 Kikobeats

@Kikobeats I was testing with this article here -> https://www.mentalfloss.com/article/90967/14-facts-about-wheres-waldo The published date and modified date are very different

ghmendonca avatar Mar 10 '23 15:03 ghmendonca

@ghmendonca that's actually expected since the new rules set is the latest timestamp vs. first timestamp

For example, in a markup like this:

<meta itemprop="datePublished" content="2016-05-24T09:42:00.000Z">
<meta itemprop="dateModified" content="2016-05-24T21:27:05.000Z">

The date field will be dateModified. That is actually okay, but I want to double-check manually all is under this expectation.

Kikobeats avatar Mar 10 '23 15:03 Kikobeats

Another pretty clear example:

CleanShot 2023-03-10 at 16 56 28@2x

The new date field is the latest modified timestamp rather than the publication date

Kikobeats avatar Mar 10 '23 15:03 Kikobeats

Coverage Status

Coverage: 98.068% (-0.2%) from 98.235% when pulling af1691e8171137c9c043e2e0a9b1d088d9c517a3 on madwings:feature/add-date into ff0e5d8194d0ff9d43e52991611d5fddfc2400b4 on microlinkhq:master.

coveralls avatar Mar 10 '23 16:03 coveralls

shipped!

Successfully published:
 - [email protected]
 - [email protected]
 - [email protected]

🎉

Kikobeats avatar Mar 11 '23 17:03 Kikobeats