whoiser icon indicating copy to clipboard operation
whoiser copied to clipboard

Normalize Date Formats?

Open jonathansampson opened this issue 2 years ago • 4 comments

Description

I'm noticing quite a few different date formats in the returned data. For example, YYYY.MM.DD, YYYYMMDD, YYYY-MM-DD, YYYY-MM-DDTHH:MM:SSZ, and more. Would it be worth considering a date-normalization step, aiming to deliver a single format? Or, is there a reason why somebody might wish to preserve the original structure?

I took a quick look at several thousand records' Date Created, and these were the formats I saw (note: every digit was replaced by 0, full month name by MMMM, short (3-letter) month name by MMM, full day name by DDDD, and short day (3-letter) names by DDD). The list is sorted by frequency of appearance, most common being at the top.

0.7625 — 0000-00-00T00:00:00Z 
0.0916 — 0000-00-00T00:00:00.00Z 
0.0233 — 0000-00-00T00:00:00Z 0000-00-00T00:00:00Z
0.0197 — 0000-00-00T00:00:00.0Z
0.0155 — 0000-00-00T00:00:00
0.0141 — 0000-00-00T00:00:00.000Z
0.0120 — 0000-00-00 00:00:00
0.0097 — 00-MMM-0000
0.0096 — 0000.00.00 00:00:00
0.0057 — 00000000 #00000000 00000000
0.0052 — 0000-00-00
0.0047 — 0000/00/00
0.0029 — DDD MMM 00 0000
0.0027 — 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00
0.0021 — 0000-00-00 00:00:00 0000-00-00 00:00:00
0.0016 — 0000-00-00T00:00:00+0000
0.0015 — 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00.000000
0.0015 — 00000000 #0000000 00000000
0.0014 — 00000000 #00000000 00000000 00000000
0.0013 — 00000000 #0000000 00000000 00000000
0.0013 — 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00
0.0013 — 0000-00-00 00:00:00 0000-00-00 00:00:00 0000-00-00 00:00:00
0.0009 — 00-MMMM-0000
0.0008 — 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00
0.0008 — 0000-00-00 00:00:00 CLST
0.0006 — DDD MMM 0 0000
0.0005 — 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00
0.0005 — 0000-00-00 00:00:00 +00:00
0.0005 — 0000-00-00 00:00:00.000000 0000-00-00 00:00:00.000000 0000-00-00 00:00:00.000000
0.0004 — 0000-00-00 00:00:00.000000 0000-00-00 00:00:00 0000-00-00 00:00:00.000000
0.0004 — DDD MMMM 00 0000
0.0004 — 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00
0.0003 — 0000-00-00 0000-00-00 0000-00-00
0.0003 — 00-MMM-0000 00:00:00
0.0003 — MMMM 00 0000 MMMM 00 0000 MMMM 00 0000
0.0002 — 0000-00-00T00:00:00+00:00
0.0002 — 00.00.0000 00:00:00
0.0002 — 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00 00.00.0000 00:00:00
0.0002 — 00/00/0000 00:00:00
0.0002 — 00000000 #000000 00000000 00000000
0.0002 — 00000000 #000000 00000000
0.0002 — 00000000 #00000 00000000 00000000
0.0001 — 0000-00-00T00:00:00.000-00:00
0.0001 — 00/00/0000
0.0001 — MMMM  0 0000 MMMM 00 0000 MMMM 00 0000
0.0001 — 0000-00-00 00:00:00+00 
0.0001 — 0000-00-00 00:00:00 0000-00-00 00:00:00.000000 0000-00-00 00:00:00.000000
0.0001 — 0000-00-00T00:00:00.000000Z 0000-00-00T00:00:00Z
0.0001 — 0000-00-00T00:00:00-0000
0.0001 — 0000-00-00 00:00:00.000
0.0001 — 0000-00-00T00:00:00+0000Z

If there's an interest here, I might be able to contribute a PR. I would appreciate any pointers to previous commits which might offer guidance for how best to integrate a feature like this. Thanks!

jonathansampson avatar Feb 01 '23 14:02 jonathansampson

hey @jonathansampson

Date-normalization is something that should/will be added eventually. I started the library with the intention to add this, but along the way saw that there a few other steps to complete first:

  • parse WHOIS from more all TLDs, saw your previous issue and probably is part of this step 😄
  • normalize data labels, some are done here https://github.com/LayeredStudio/whoiser/blob/master/src/parsers.js#L178-L211 and more still need to be discovered and added, ex .edu https://dmns.app/midlandstech.edu/whois
  • detect what date formats, and possibly timezones, are used by different TLDs, then convert it to a standard format

I'm still working on the first 2 steps, so haven't even looked at date formats yet. If you want tackle this, it would be great.

This library is used for https://dmns.app and if you check domains there you'll see that dates are not always displayed correctly. Dates are handled with simply doing new Date(domain['Date Created']) so it fails sometimes.

Also other things I noticed:

  • dates have a common/good format for most gTLDs (.com, .app, .link, etc)
  • dates have different formats for ccTLDs (.it, .fr, .ly, etc) and these are the sources for DDD MMM 00 0000, 0000-00-00 00:00:00 CLST etc
  • the more weird dates 0000-00-00 00:00:00.000000 0000-00-00 00:00:00 0000-00-00 00:00:00.000000 are from WHOIS data that is not properly parsed

AndreiIgna avatar Feb 01 '23 14:02 AndreiIgna

Thank you for the detailed response. Permit me a possibly silly question, but is it safe to assume a TLD will only ever have a single structure style? For instance, I'm noticing the library doesn't presently parse .it results accurately (e.g. it blends the created property for the Registrant in with the created property for the domain. The registrant properties are preceded by white-space, and follow the Registrant line). I suspect creating a parser for this style would be fairly simple, but I wondered if the parser would apply to all .it domains, or if some endpoints may return a different document structure.

jonathansampson avatar Feb 01 '23 16:02 jonathansampson

Issued a PR to address this issue in part: https://github.com/LayeredStudio/whoiser/pull/92

jonathansampson avatar Feb 01 '23 19:02 jonathansampson

From what I've seen so far, data structure & format is returned only in a single format by a big chunk of TLDs. There is a standard format that is shared by .com, .net and many other gTLDs

The problem is with old TLDs, and most notably with country TLDs. After a parser is added, like for .it (👍 thanks), we can assume the data will stay in that format.

AndreiIgna avatar Feb 04 '23 10:02 AndreiIgna