RegexHub icon indicating copy to clipboard operation
RegexHub copied to clipboard

I'm sorry, but there's loads of issues...

Open RobThree opened this issue 7 years ago • 9 comments

HTML Tags /^<([a-z1-6]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

Famous answer

Hex Value /^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$/

Hex HTML/CSS color value maybe, but 0xDEADBEAF is a perfectly valid hex value.

Password /^[a-zA-Z0-9+_-]{6,32}$/

Slowly we're moving the world to password phrases and everybody should be hashing their passwords. Then why the 32 char limit? And why, for Pete's sake, are we only allowing a-zA-Z0-9+_- and nothing else? *cries* (see also)

Email /^([a-z0-9+_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,24})$/

Yeah. Just. No. Another famous answer

Positive number /^\d*\.?\d+$/

We don't all live in the US/UK. (1,234.56 v.s. 1.234,56)

Phonenumber /^\+?[\d\s]{3,}$/

+123 is a valid phonenumber? Where? Phonenumbers are notoriously hard to validate (hence libphonenumber for example).

Date in format dd/mm/yyyy /^(0?[1-9]|[12][0-9]|3[01])([ \/\-])(0?[1-9]|1[012])\2(19[0-9][0-9]|20[0-9][0-9])$/

Failed the very first 'edge case' I could come up with: 30/02/2016 but also 1852 or 2150 fail... ( as noted elsewhere).

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski

RobThree avatar Nov 29 '16 10:11 RobThree

Thanks @RobThree some valid points there. Though the date pattern is matching on 30/02/2016 for me.

Regarding the HTML tag pattern, that's pretty useful for plain text HTML, like in an editor.

I've now removed the password pattern as that was proving particularly controversial.

PRs are very welcome if you want to make any improvements yourself.

lukehaas avatar Nov 29 '16 10:11 lukehaas

Regarding the HTML tag pattern, that's pretty useful for plain text HTML, like in an editor.

<b>this html</b><b>would beg the differ</b>

dwelle avatar Nov 29 '16 10:11 dwelle

Though the date pattern is matching on 30/02/2016 for me

Except that feb. 30th doesn't exist 😉

Regarding the HTML tag pattern, that's pretty useful for plain text HTML, like in an editor.

Except that there are a gazillion ways the regex will match incorrectly (demonstrated here) or cause trouble otherwise. Have you read the stackoverflow answer I linked to?

PRs are very welcome if you want to make any improvements yourself.

All the ones I pointed out are very case-specific and hard, if not impossible (html, email for example), to get correct. Though I can think of improvements here-and-there I'd suggest taking them all down; for most, if not all, of the regexes there are better ways of handling and validating the inputs (like simply parsing a date(time) value to 'validate' it or sending an activation e-mail to verify an e-mail address).

Regexes do have their use, I'm not saying they don't. But, as said, for most (if not all) of the examples there are much better solutions.


Edit: Here's more I just stumbled upon.

RobThree avatar Nov 29 '16 10:11 RobThree

Re: Emails. The only true way to validate emails is with basic pattern matching. Something along the lines of looking for @.* is the most you can possibly hope to do.

I completely agree with Rob on that point.

CSobol avatar Nov 29 '16 20:11 CSobol

@CSobol email pattern has now been updated with this PR: https://github.com/lukehaas/RegexHub/pull/15

lukehaas avatar Nov 29 '16 20:11 lukehaas

It also lacks a ^ and $ for the time pattern, just like the date one, otherwise it matches "4;30" when you input "24:00"

fer22f avatar Nov 29 '16 22:11 fer22f

I seem to run into a bug with the pattern document.body.innerHTML=flags//whoops ;)

bathos avatar Nov 30 '16 04:11 bathos

For the email, several regex can help to filter some bad formats.

Lot of sites are still expecting 'simple' emails, eg. max 3 chars for TLD (.com)! The question is to know if you want a valid one or one that will work on almost on all sites.

Few filters

Maximum length: 254 due to network protocols, not email specs, search RFC... Minimum length: 7 like [email protected]

.{7,254}

Rough validation of min/max length blocks: .{1,248}@.{2,250}\..{2,64}

Enhancing this formula, the 3 lengths, is ?impossible? in regex as you need to know the length of each part, must use javascript, not just regex.

Just for Latin char set, supposing case insensitive is set (/..../i): [a-z][a-z0-9\._-]{0,246}[a-z0-9]@[a-z][a-z0-9\._-]{0,248}[a-z0-9]\.[a-z][a-z0-9\.]{0,61}[a-z] (Should verify the above one)

A bit more international, but invalid characters are not filtered (spacesss, tabsss, I think controls are except DEL):

[!-\uFFEF]{1,248}@[!-\uFFEF]{2,250}\.[!-\uFFEF]{2,64}

64 is the maximum, today the maximum existing is 24 XN--VERMGENSBERATUNG-PWB http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Few links

To test your suppositions: http://cobisi.com/support/kb/emailverify.net/verification-process/validation-levels

Free Mailgun.com validation api, not just the syntax: https://www.mailgun.com/email-validation

Explanation of unicode in regex: http://www.regular-expressions.info/unicode.html

For the lazy one, this one is from a framework, dont remember which one... But mailgun is ok. Apparently it respects all the rules, except it does not check the length, see above.

function is_valid_email_address(email_address) { var pattern = new RegExp(/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_{|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(.([a-z]|\d|[!#$%&'*+-/=?^_{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i);

TraderStf avatar Dec 02 '16 12:12 TraderStf

The question is to know if you want a valid one or one that will work on almost on all sites.

That's an easy answer. When it comes to email addresses, you never want to stop a valid user from signing up via email address. You would much rather take a hundred junk email address than prevent one valid user from signing up or filling out a form.

CSobol avatar Dec 14 '16 16:12 CSobol