arrow icon indicating copy to clipboard operation
arrow copied to clipboard

Improve parsing of nautral language string with punctuation

Open systemcatch opened this issue 4 years ago • 1 comments

I don't think any number would work in my implementation but I can increase it to any finite number you want (say 2 or 3 punctuation marks). I think one is fine; more than two is probably overkill.

Edit: Maybe allow 3 or 4 due to the use of "...", although I'm not sure how often people use those after dates. I could see people using a date like this: He said, "The date is 1/2/13." So maybe increasing the constraint is actually a good idea, and I can increase it infinitely following the date, just not preceding it.

Originally posted by @andrewchouman in https://github.com/crsmithdev/arrow/pull/720

===========================================================

I tend to agree, but the only thing that concerns me is that this worked pre 0.15.0 (I chose 0.13.0 for example):

venv ❯ python3
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import arrow
>>> arrow.__version__
'0.13.0'
>>> arrow.get("This date has too many punctuation marks following it 11.11.2011", "DD.MM.YYYY")
<Arrow [2011-11-11T00:00:00+00:00]>
>>> arrow.get("This date has too many punctuation marks following it (11.11.2011)", "DD.MM.YYYY")
<Arrow [2011-11-11T00:00:00+00:00]>
>>> arrow.get("This date has too many punctuation marks following it (11.11.2011).", "DD.MM.YYYY")
<Arrow [2011-11-11T00:00:00+00:00]>

This is definitely an improvement, but for full pre-0.15.0 behavior while still containing improvements, we probably need to add support for any number of punctuation marks. Curious, why would finite numbers work but not infinite (e.g. with the + quantifier in regex)?

Originally posted by @jadchaar in https://github.com/crsmithdev/arrow/pull/720

systemcatch avatar Dec 03 '19 20:12 systemcatch

We definitely need to figure out a way to make the regex simpler and more general. It would be nice to allow for n number of punctuation marks rather than hardcoding an amount.

A starting word boundary of (?<![\S]) and an ending word boundary of (?![\w]) could be a possibility.

jadchaar avatar Mar 03 '20 17:03 jadchaar