dateparse icon indicating copy to clipboard operation
dateparse copied to clipboard

Comprehensive validation 🔎, 30+ fixes integrated/added 🔨🐛, optimized performance 🚀

Open klondikedragon opened this issue 1 year ago • 7 comments

This package is amazing and hugely popular, and has been the best package for automatic date parsing in go for years! ⭐

Thanks @araddon for crafting this package with love over the years!!

I've been using this while developing a new cloud-based log aggregation/search/visualization product, and I've found that there are three major opportunities for improvement for my particular use case:

  • The package does not strictly validate its input, leading to many false positives. This is OK if you know for sure the input matches one of the known formats, but cannot be trusted if the input could be anything and you only want a returned date/time if it definitely matches a known format.
  • While still being far more efficient than the "shotgun" parsing approach, the package currently allocates a relatively large amount of memory (several times the average input size), which can add up when parsing megabytes of date strings per second in a high-throughput microservice. It can also be relatively slow when parsing a string that doesn't match a known format and can allocate even more memory in this case, due to custom error messages that include contextual details.
  • There are a lot of unmerged one-off contributions that haven't been merged and need to be made coherent with each other.

This PR addresses all 3 opportunities:

  • Validation: it now comprehensively validates the input in the following ways: at each point in the state machine, it makes sure there are cases for all possibilities, and any invalid possibility will fail; additionally, it makes sure that the entire format string was specifically set (excluding any trailing punctuation, which can be safely ignored). False positives should be extremely rare now (hard to prove they don't exist).
  • Memory Efficiency: bytes allocated have been reduced by 90%, and parsing some formats are zero allocation. The SimpleErrorMessages option was added (off by default) that greatly speeds up the case where a string does not match a known format -- with the option on, this case is now 4x faster and produces almost no allocations.
  • Merge & Integrate Community Fixes: Fixes for nearly all of the pending issues from the community (including open pull requests) have been incorporated and adapted.

In the process of going through the state machine comprehensively for validation, redundant code/states were merged, and support was added for certain edge cases (for example, some date formats did not support being followed by times).

The example and README.md were updated to incorporate all of the newly supported formats and edge cases. More details on how to properly interpret returned location information with respect to abbreviated timezones was added.

BREAKING -- the package now requires go >= 1.20 to support memory optimizations converting from []byte to string in key places.

A huge thanks to all who posted issues and contributed PRs -- while the PRs were unable to be merged directly because the validation changes were so major, the ideas of all these contributions and the associated test cases were incorporated. Here's credit for all of the issues fixes and contributions in this PR as well as a summary of additional fixes added:

  • This PR started directly building from https://github.com/araddon/dateparse/pull/151 by @arran4 that fixes https://github.com/araddon/dateparse/issues/78 (support for GMT+offset, support additional whitespace in certain places, and other changes)
  • Fix https://github.com/araddon/dateparse/issues/145 - words falsely detected as a valid format - fixed by comprehensive validation
  • Fix https://github.com/araddon/dateparse/issues/108 - "1.jpg" incorrectly parsed as valid - fixed by comprehensive validation
  • Fix https://github.com/araddon/dateparse/issues/98 - names and addresses incorrectly parsed as valid - fixed by comprehensive validation
  • Fix https://github.com/araddon/dateparse/issues/150 - fix parsing "am" as "pm"
  • Fix https://github.com/araddon/dateparse/issues/157 - fix parsing of Thu Jan 28 2021 15:28:21 GMT+0000 format
  • Fix https://github.com/araddon/dateparse/issues/137 - variant of ':' as separator for fraction seconds - adapted https://github.com/araddon/dateparse/pull/138 by @dferstay
  • Fix https://github.com/araddon/dateparse/issues/139 - add support for dd-mm-yyyy (digit month) - adapt https://github.com/araddon/dateparse/pull/140 by @dferstay (also fixes https://github.com/araddon/dateparse/issues/155)
  • Fix https://github.com/araddon/dateparse/issues/141 - add support for yyyy mon dd format - adapt https://github.com/araddon/dateparse/pull/142 by @dferstay
  • Fix https://github.com/araddon/dateparse/issues/143 - add support for yyyymmddhhmmss.SSS format - adapt https://github.com/araddon/dateparse/pull/144 by @dferstay
  • Fix https://github.com/araddon/dateparse/issues/130 - fix timezone parsing after fractional seconds - adapted https://github.com/araddon/dateparse/pull/131 by @zifengyu
  • Fix https://github.com/araddon/dateparse/issues/123 - fix parsing of fractional seconds for certain formats (e.g., 2017-04-03 22:32:14.322 CET)
  • Fix https://github.com/araddon/dateparse/issues/109 - fix parsing of Sun, 07 Jun 2020 00:00:00 +0100 format
  • Fix bug in comment https://github.com/araddon/dateparse/issues/100#issuecomment-1118868154 - fix parsing of 1 April 2022 23:59 format (time after certain date formats)
  • Fix https://github.com/araddon/dateparse/issues/129 - fix ambiguous and PreferMonthFirst parsing for format mm.dd.yyyy (time) - adapted https://github.com/araddon/dateparse/pull/133 by @mehanizm (also fixes https://github.com/araddon/dateparse/issues/91 and https://github.com/araddon/dateparse/issues/28)
  • Fix https://github.com/araddon/dateparse/issues/149 - support PMDT and AMT time zones and validate that AM/PM indicators only appear at most once
  • Fix https://github.com/araddon/dateparse/issues/127 - add support for dd[th,nd,st,rd] Month yyyy format - adapt https://github.com/araddon/dateparse/pull/128 by @krhubert
  • Fix https://github.com/araddon/dateparse/issues/158 - fix parsing for format (time) UTC[+-]NNNN
  • Add support for mm/dd/yyyy, hh:mm:ss format - adapt https://github.com/araddon/dateparse/pull/156 by @BrianLeishman
  • Add support for yyyy.mm.dd (time) format - adapt https://github.com/araddon/dateparse/pull/134 by @jmdacruz, and add cases expected to fail to TestParseErrors unit test
  • Add support for git log format (e.g., Thu Apr 7 15:13:13 2005 -0700) - adapt commit https://github.com/araddon/dateparse/pull/92/commits/99d9682a1cbe7a14975b5b71704af8847c2684f9 from https://github.com/araddon/dateparse/pull/92 by @jiangxin (merge timeWsYearOffset case and add validation)
  • Add support for RabbitMQ log format (e.g., dd-mon-yyyy::hh:mm:ss) - adapt https://github.com/araddon/dateparse/pull/122 by @bizy01
  • Expand support for Chinese date formats, and allow times to follow - adapt https://github.com/araddon/dateparse/pull/132 by @xwjdsh
  • Add support for mon/dd/yyyy format, e.g., Oct/31/1970
  • Add support for dd-month-year format
  • Extend format yyyy-mon-dd to allow times to follow it. Also allow full month name instead of just abbreviated.
  • Fix the case for ambiguous date/time in the mm:dd:yyyy format
  • Allow full day name before month (e.g., Monday January 4th, 2017)
  • Additional fixes for mm.dd.yyyy (time) format
  • Fix ambiguous parsing for mm/dd formats that start with a weekday

Also adds tests to verify that the following stay fixed:

  • https://github.com/araddon/dateparse/issues/94 - mysql log format such as 190910 11:51:49
  • https://github.com/araddon/dateparse/issues/117 - fractional seconds after ':'

klondikedragon avatar Dec 21 '23 06:12 klondikedragon

Great work @klondikedragon

arran4 avatar Dec 21 '23 23:12 arran4

this is great work @klondikedragon! Now, this repo hasn't seen much movement in years, do you think we should start using a fork? should we use yours?

jmdacruz avatar Dec 23 '23 21:12 jmdacruz

I would vote that we use his, we should see if it qualifies for https://github.com/avelino/awesome-go

arran4 avatar Dec 24 '23 00:12 arran4

In some further testing, I found that weekday prefixes only worked for some date formats, but not for others. So that is fixed now. As a side effect (benefit?), leading whitespace is now allowed/ignored.

Let's see if @araddon has feedback and/or is interested in merging this PR (it's a pretty big change and changes the philosophy a bit to have validation, and also makes the code a little more complex in favor of performance). The changes are large enough now it could break backwards compatibility, so in the very least it should deserve a new major version IMO.

Although I don't want to fork something lightly, since we haven't heard any feedback from @araddon for a few years, it could definitely make sense to go ahead. If there is no comment after the holidays, I think it would make sense to go ahead and fork. This package is a key part of freeform date parsing in the "automatic structured field extraction" logic being built for IT Lightning (this new cloud-based log management platform I'm building). Given that's the case, the IT Lightning org would be willing to maintain the forked repo and work with community issues/contributions, since we're motivated to have best-in-class date recognition & parsing in the log ingestion pipeline. The license would remain the same of course.

The community contributions would help us improve our date parsing, we'd be motivated to put energy into it to keep our date parsing bug-free and comprehensive, and community use of the package might help us get a little exposure to devs/SREs who might become interested in our log management solution. So it should benefit everyone.

All feedback is welcome. What do ya'll think of this proposed plan?

klondikedragon avatar Dec 24 '23 03:12 klondikedragon

great work @klondikedragon . How can i start using this?

elliot40404 avatar Jan 03 '24 20:01 elliot40404

I'll go ahead and fork this package. I'm renaming the main branch as part of that.

klondikedragon avatar Jan 09 '24 02:01 klondikedragon

The fork is complete and published as v0.1.0 -- again, a huge thanks to @araddon for authoring and maintaining this package for so many years!

The fork is available using go get github.com/itlightning/dateparse -- issues and PRs are welcome.

@elliot40404 @arran4 @jmdacruz -- see what you think and how this updated package works! If this looks good and after incorporating feedback, I think I'll publish a v1.0.0 at some point soon. I'm also curious to get feedback on my log management project too, check out the site/discord if you're interested. Thanks!

klondikedragon avatar Jan 09 '24 06:01 klondikedragon