metascraper
metascraper copied to clipboard
[metascraper-date] differenciate between publish and update date
metascraper-date returns the date the article was last updated, not published. Could you add a package that would return the date the article was actually published?
Thanks in advance, Damian
Hello,
Can you explain better how are you using metacraper?
Based on my experience, the number of web pages that put (a correct) date metadata is very very low, so that's why created and updated rules are mixed in the same rules set.
If you need to do that, you can achieve that using your own rules set: https://github.com/microlinkhq/metascraper#write-your-own-rules
I was just going to open the same issue. I think metascraper-date should prefer published time above updated time.
For example the article at https://www.zdnet.com/article/3d-printing-with-light-scientists-create-3d-holograms/ shows clearly "January 25, 2018 -- 12:31 GMT (12:31 GMT)" and is also contained as metadata, but metascraper-date returns 2019-01-15T21:45:42.000Z because the metadata also contains an updated time.
Ideally these would have separate fields, but I personally would assume that 'date' is the original publish date, not an update date. (The preference could of course be an argument for the parser.)
Could this issue be re-opened?
Happy to accept a PR adding an argument to determinate if the date can be obtained from update date rules 🙂
I might implement this at some point. My idea was that there would be an argument preference: ["create", "update", "generic"] which would filter and order the rules. Does this seem appropriate?
Does the first matching rule in the array apply (so I can order the rules based on the preference provided)?
Just define a way for adding conditional rules that depend on an option argument.
The name of the argument is a thing we can discuss in the PR 🙂
This how I see we can have both published and updated dates.
module.exports = () => ({ published: [ wrap($jsonld('datePublished')), wrap($jsonld('dateCreated')), wrap($ => $('meta[property*="published_time" i]').attr('content')), wrap($ => $('meta[property*="release_date" i]').attr('content')), wrap($ => $('meta[name="date" i]').attr('content')), wrap($ => $('[itemprop="datepublished" i]').attr('content')), wrap($ => $('[itemprop*="date" i]').attr('content')), wrap($ => $('time[itemprop*="date" i]').attr('datetime')), wrap($ => $('time[datetime]').attr('datetime')), wrap($ => $('time[datetime][pubdate]').attr('datetime')), wrap($ => $('meta[name*="dc.date" i]').attr('content')), wrap($ => $('meta[name*="dc.date.issued" i]').attr('content')), wrap($ => $('meta[name*="dc.date.created" i]').attr('content')), wrap($ => $('meta[name*="dcterms.date" i]').attr('content')), wrap($ => $('[property*="dc:date" i]').attr('content')), wrap($ => $('[property*="dc:created" i]').attr('content')), wrap($ => $filter($, $('[class*="byline" i]'))), wrap($ => $filter($, $('[class*="dateline" i]'))), wrap($ => $filter($, $('[id*="metadata" i]'))), wrap($ => $filter($, $('[class*="metadata" i]'))), // twitter, move into a bundle of rules wrap($ => $filter($, $('[id*="date" i]'))), wrap($ => $filter($, $('[class*="date" i]'))), wrap($ => $filter($, $('[id*="publish" i]'))), wrap($ => $filter($, $('[class*="publish" i]'))), wrap($ => $filter($, $('[id*="post-timestamp" i]'))), wrap($ => $filter($, $('[class*="post-timestamp" i]'))), wrap($ => $filter($, $('[id*="post-meta" i]'))), wrap($ => $filter($, $('[class*="post-meta" i]'))), wrap($ => $filter($, $('[id*="time" i]'))), wrap($ => $filter($, $('[class*="time" i]'))) ], updated: [ wrap($jsonld('dateModified')), wrap($ => $('meta[property*="updated_time" i]').attr('content')), wrap($ => $('meta[property*="modified_time" i]').attr('content')), wrap($ => $('[itemprop*="datemodified" i]').attr('content')), ], });
This is now addressed! just use metascraper-date v5.34.0 or above:
const date = require('metascraper-date')({ datePublished: true, dateModified: true })