parser icon indicating copy to clipboard operation
parser copied to clipboard

date_published incorrectly uses current date when <abbr class="published"> contains valid datetime

Open julia2404 opened this issue 7 months ago • 0 comments

Bug: Incorrect date_published when parsing valid <abbr class="published">

Description:

When parsing this page: 👉 https://www.progressive-charlestown.com/2011/04/peeps-wrap-up-for-2011.html

The parser returns: "date_published": "2025-05-15T00:23:00.000Z"

However, the HTML clearly includes: < abbr class='published' title='2011-04-25T00:23:00-04:00'>12:23:00 AM< /abbr >

This means the correct UTC datetime would be: "date_published": "2011-04-25T04:23:00.000Z"

It seems the parser extracts the time from but incorrectly replaces the date with the current system date.

Expected behavior

The parser should correctly parse both date and time from the title attribute in , not just the time part.

Steps to reproduce

Use the latest version of the parser (npm or hosted) and parse the provided URL.

Environment:

Parser version: latest (GitHub)

Runtime: Node.js

Used via: Node script / Web API

julia2404 avatar May 15 '25 11:05 julia2404