parser icon indicating copy to clipboard operation
parser copied to clipboard

Time zone difference causes tests to fail

Open Buratinator opened this issue 5 years ago • 1 comments

  • Platform: Windows 10 Home x64
  • Mercury Parser Version: 2.2.0
  • Node Version (if a Node bug): v10.15.3
  • Browser Version (if a browser bug): Firefox 75.0 (64-bit)

Expected Behavior

I was expecting that date/time would be converted into the same value regardless of where the test is run.

Current Behavior

In local tests, date content is converted into a UTC date/time offset by my time zone. I'm in UTC+3, so 3 hours are subtracted from the date.

In automatic tests when I submit a PR, that same date is treated as true UTC+0 date.

Thus, this HTML:

    <div itemprop="datePublished" class="publication-date">
      <span class="publication-day">Apr 6</span>
      <span class="publication-year">2020</span>
    </div>

...with this date extractor:

  date_published: {
    selectors: [
      // enter selectors
      ".publication-date[itemprop='datePublished']",
    ],
  },

generates this on my local machine: 2020-04-05T21:00:00.000Z

But Postlight/circleci automatic tests at PR submission generate 2020-04-06T00:00:00.000Z

(https://circleci.com/gh/postlight/mercury-parser/4216?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link)

Being new to all this, I don't see a way to pass tests both on my local computer and upon the PR submission. I don't want to have to fudge with this. Is there a setting or environment variable to ensure uniform treatment of dates?

Steps to Reproduce

The page is at http://med.stanford.edu/news/all-news/2020/04/smart-toilet-monitors-for-signs-of-disease.html (I do apologize for the topic of that article lol).

Buratinator avatar Apr 14 '20 20:04 Buratinator

I ran into the same problem and found the following solution:

You can supply a timezone option to the date_published field in your extractor. In my case it looks like this:

  date_published: {
    selectors: [
      '.content__meta__date'
    ],
    timezone: 'Europe/Berlin'
  },

This should make the extractor return the same fixed result both on your machine and in the CI environment.

Originally I was hoping I could inject this option during the test somehow so that I could restrict it as a workaround for the test only. But then I realised that it actually makes sense to keep it in the extractor in my case: the extractor is for a German website, so the dates shown on the articles on the website should be interpreted according to Germany's timezone.

Now this might not make sense for all websites, especially not for international ones, and it's a bit annoying having to supply it in all extractors. So having a way to set a fixed timezone in tests (or for the test setup to do this for us) would still be nicer. Alternatively (or additionally) I'd like to see the documentation for custom extractors updated to describe the timezone and format options of the date_published extractor and potentially warn about such problems in tests.

Shepard avatar Jul 28 '22 13:07 Shepard