pregex
pregex copied to clipboard
Create classes Email and Date in pregex.meta.essentials
Hi, are you already working on it? I can try adding an Email and Date classes if you want.
@dylannalex I've actually finished the Email one, though Date is yet to be made. You can have a go at it, it first needs some thought on the design though. I'm thinking of it having a single string parameter "format", through which you define the format of the date that you're willing to match, e.g. "mm/dd/yyyy". You can find more formats here. The other thing i'm thinking is that we could have Date(*formats), so you can match many formats with one instance. So, instead of one having to do:
from pregex import *
date1 = Date("mm/dd/yyyy")
date2 = Date("dd/mm/yyyy")
dates = op.Either(date1, date2)
they can just do:
from pregex import *
dates = Date("mm/dd/yyyy", "dd/mm/yyyy")
In case no formats are provided, then the deafult will be to match any date format. What do you think?
I'm gonna create a branch called v2.0.1, as well as another branch based on this issue. You can work on it there.
Hi @manoss96, I've been working on the Date class. Here's what I came up with:
Date formats
I added Date.date_formats
, a Date class attribute that contains all valid date formats:
from pregex import *
Date.date_formats
>>> ('mm/dd/yyyy', 'dd/mm/yyyy', 'yyyy/mm/dd')
By default, Date matches any date format in Date.date_formats
.
Date arguments
I followed your suggestion and let the user match many formats with a single Date instance.
from pregex import *
text ="""
01/11/2001
12/09/1996
1875/11/02
"""
pre1 = Date()
pre1.get_matches(text)
>>> ['01/11/2001', '12/09/1996', '1875/11/02']
pre2 = Date("dd/mm/yyyy")
pre2.get_matches(text)
>>> ['01/11/2001', '12/09/1996']
pre3 = Date("dd/mm/yyyy", "yyyy/mm/dd")
pre3.get_matches(text)
>>> ['01/11/2001', '12/09/1996', '1875/11/02']
Note: Date converts all uppercase characters in a date format into lowercase characters (e.g. "DD/MM/YYYY" is converted to "dd/mm/yyyy")
Invalid formats
The given formats are compared to date formats on Date.date_formats
. When an invalid format is found, Date raises InvalidArgumentValueException
.
pre = Date("dd/mm/yyy")
>>> pregex.core.exceptions.InvalidArgumentValueException: Provided date format "dd/mm/yyy" is not valid.
Let me know your thoughts. I'm up to adding more features or improving any aspect you consider!
I think this should also consider short-hand notations for years such as 02
for 2002
. It might also make sense to add notations for time as well. Something like the strptime
function in the datetime module.
For example, you could have D/M/y
to match things like 01/03/02
, but D/M/Y
to match stuff like 01/03/2002
.
I feel like this format makes sense because it's already synonymous with other Python libraries and won't be a hassle for users to learn.
@dylannalex Looks great, good job! As for the formats, I suggest that we follow this notation. That way we can have all lowercase while at the same time we can differentiate between 2002 and 02 like @alansun17904 said. For now, I'd say that implementing any valid combination of "d/dd", "m/mm", and "yy/yyyy" along with separators "/" and "-", is good enough. In the future, more formats might follow.
To wrap up, I suggest the following list of formats:
- d/m/yy
- dd/m/yy
- d/mm/yy
- dd/mm/yy
- d/m/yyyy
- dd/m/yyyy
- d/mm/yyyy
- dd/mm/yyyy
- m/d/yy
- mm/d/yy
- m/dd/yy
- mm/dd/yy
- m/d/yyyy
- mm/d/yyyy
- m/dd/yyyy
- mm/dd/yyyy
- yy/m/d
- yyyy/m/d
- yy/mm/d
- yyyy/mm/d
- yy/m/dd
- yyyy/m/dd
- yy/mm/dd
- yyyy/mm/dd
Plus all of the above using the "-" separator, suming to a total of 24 + 24 = 48 different formats.
I don't know about your current implementation, but I suggest having a dictionary of 6 different keys, namely "d", "dd", "m", "mm", "yy" and "yyyy", each mapping to a different pre-defined "Pregex" instance for matching each possible part of the date. Then it's just a matter of combining these instances together, separated by either "/" or "-". How's that sound?
Sounds great, @manoss96! Thank you and @alansun17904 for the help!
About what @alansun17904 said, I'd avoid date time values for now, since I consider it would be better to have a Date class for matching only dates and a Time class for matching time values. Once we have these two classes working, implementing a DateTime class should be as easy as merging Date and Time.
About what @alansun17904 said, I'd avoid date time values for now, since I consider it would be better to have a Date class for matching only dates and a Time class for matching time values. Once we have these two classes working, implementing a DateTime class should be as easy as merging Date and Time.
Yeah I agree with @dylannalex . As for the implementation that we discussed, feel free to use other classes from pregex.meta as they might help you. For example, you can use Integer(1, 12) for "m".
About default formats, it is impractical to hardcode all the 48 different combinations. What about adding an static method Date.date_formats()
to compute all different format combinations. I think the itertools.permutations from the standard library would be a great tool for this task. Let me know if I can import this function!
Oh, and one last thing:
you can use Integer(1, 12) for "m"
Do you mean Integer(1, 10)?
About default formats, it is impractical to hardcode all the 48 different combinations. What about adding an static method Date.date_formats() to compute all different format combinations. I think the itertools.permutations from the standard library would be a great tool for this task. Let me know if I can import this function!
Sure, you can use it. Just make sure that you import it with a different name starting with a "_" so it isn't directly imported every time pregex.meta is imported. Better yet, import it within the "Date" class itself.
Oh, and one last thing:
you can use Integer(1, 12) for "m"
Do you mean Integer(1, 10)?
Yeah I'm sorry you're right. I was under the impression that "m" matched "11" and "12" too, and that it only indicated that a single-digit month must not have a leading zero, e.g. "3" would be okay but "03" would not. In that case, I'm guessing using "Integer" would be an overkill so you can go with something simpler. However, if you find that some class in pregex.meta could help you, don't hesitate using it!
I've finished the Date class implementation. I've implemented each 48 different combinations dynamically, so adding new date formats should be straightforward.
Features:
- If no format is provided, Date considers all possible formats.
- All format provided are converted to all lower case (e.g. dD/mM/yyYY is converted to dd/mm/yyyy).
- Raises InvalidArgumentValueException when an invalid format is provided.
I also didn't use itertools.permutations, so no extra import needed!
I'm now working on documentation, which it's not my strong point. I'd really appreciate some help with it 😄
In a nutshell, the Date class has the following structure:
class Date(_pre.Pregex):
'''
Matches any date.
:param str \*formats: Strings that determines which date formats to be considered a match.
A date can either be dd/mm/yy, mm/dd/yy or yy/mm/dd (separated by by '/' or '-'), where:
yy – two-digit year, e.g. 21
yyyy – four-digit year, e.g. 2021
m – one-digit month for months below 10, e.g. 3
mm – two-digit month, e.g. 03\
d – one-digit day of the month for days below 10, e.g. 2
dd – two-digit day of the month, e.g. 02
By default, all date formats are considered.
:raises InvalidArgumentValueException: Invalid date format provided.
'''
__date_separators: tuple[str, str] = ("-", "/")
__date_value_pre: dict[str, _pre.Pregex] = {
"d":_cl.AnyDigit() - "0",
"dd":_op.Either("0" + _cl.AnyDigit(), PositiveInteger(10, 31)),
"m":_cl.AnyDigit() - "0",
"mm":_op.Either("0" + _cl.AnyDigit(), PositiveInteger(10, 12)),
"yy":_cl.AnyDigit() * 2,
"yyyy":_cl.AnyDigit() * 4,
}
def __init__(self, *formats: str):
'''
Matches any date.
:param str \*formats: Strings that determines which date formats to be considered a match. \
A date can either be dd/mm/yy, mm/dd/yy or yy/mm/dd (separated by by '/' or '-'), where:
yy – two-digit year, e.g. 21
yyyy – four-digit year, e.g. 2021
m – one-digit month for months below 10, e.g. 3
mm – two-digit month, e.g. 03
d – one-digit day of the month for days below 10, e.g. 2
dd – two-digit day of the month, e.g. 02
By default, all date formats are considered.
:raises InvalidArgumentValueException: Invalid date format provided.
'''
def __date_pre(format: str) -> _pre.Pregex:
"""
Converts a date format into a ``Pregex`` instance.
:param str format: The date format to be converted.
"""
def __date_formats() -> list[str]:
'''
Returns a list containing all possible date format combinations.
'''
Looks good! Don't worry about documentation, I can do this later. A few points:
- Make sure you do (cl.AnyDigit() - "0") in "mm" and "dd" so a match with "00" isn't possible.
- Replace "PositiveInteger" with "Integer" as the former will try to match the sign "+" too.
- Add some tests in "tests/test_meta_essentials.py" if it's easy for you. Nothing crazy, just trying to match some valid/invalid dates. You can copy the testing structure of classes like HttpUrl, IPv4 and IPv6.
After doing these I think that you're good to go, so open a PR whenever you're ready.
I think this seems great! I can help out with documentation as well if need be.
Thanks for your help, @manoss96!
PR is open. I've added tests and fixed what we discussed. I also ensured date values (i.e. 'd', 'dd', 'm', 'mm', 'yy', 'yyyy') are not enclosed by any other digit:
__date_value_pre: dict[str, _pre.Pregex] = {
"d":_asr.NotEnclosedBy(_cl.AnyDigit() - "0", _cl.AnyDigit()),
"dd":_asr.NotEnclosedBy(
_op.Either("0" + (_cl.AnyDigit() - "0"), Integer(10, 31)),
_cl.AnyDigit()),
"m":_asr.NotEnclosedBy(_cl.AnyDigit() - "0", _cl.AnyDigit()),
"mm":_asr.NotEnclosedBy(
_op.Either("0" + (_cl.AnyDigit() - "0"), Integer(10, 12)),
_cl.AnyDigit()),
"yy":_asr.NotEnclosedBy(_cl.AnyDigit() * 2, _cl.AnyDigit()),
"yyyy":_asr.NotEnclosedBy(_cl.AnyDigit() * 4, _cl.AnyDigit()),
}
Greetings.
@dylannalex Played around with the class and it looks great. Good job. Since both "Email" and "Date" are done, I'm closing this issue.