noam Major flaw with $ logic

FYI, there is a major flaw in this regex simplifier's logic. $ does not represent the empty string; it represents the end of a string (or, with the /m modifier, the end of a line). So, $+ is meaningless, and $a can never match anything.

For example, foo$ matches foo but not foobar.

foo$

Regular expression visualization

Debuggex Demo

Jan 10 '14 16:01 edcottrell

Hello Ed, thanks for the comment.

This is not actually a bug, it's a design choice which I'll try to explain now. Noam doesn't support Perl regexes (or any other particular regex flavor). The language for defining regular expressions in Noam is intentionally extremely simple and minimal, closely akin to something you'd find in any automata/languages textbook. The goal of a Noam regular expression is to define a regular language - nothing more and nothing less. Specifically, the goal of regular expressions in Noam is not to enable users to match and slice up parts of text - that would really be a silly thing to reimplement as JavaScript regexes already do that job.

In this context, defining the start or the end of the string is meaningless - they are both implicitly there at the start and end of the regular expression. You can't search for a match somewhere in your string. You can only test if the whole string is in a language defined by a regular expression or if it is not.

With that in mind, very early on we decided to use the dollar symbol to represent the empty string (usually denoted by epsilon in textbooks) so that regular expressions containing them were more readable and less error prone (for example, you can't define an optional "a" with the regular expression "a?" as you might do normally as the question mark is actually not an operator at all... you'd use something like "a|$" which looks nicer than "a|"). When you're defining a language, epsilons will be much more frequent then empty strings would be in a regex you were using to match text.

You can see a full explanation of the language for defining regular expressions here http://ivanzuzak.info/noam/webapps/regex_simplifier/ or in a comment around the 1450th line here https://github.com/izuzak/noam/blob/master/src/noam.re.js where the string representation of regular expressions is defined. I agree it might be helpful if we made this more explicit in the readme, but Noam started out with finite automata, their manipulation and visualization, and regular expressions were added afterwards primarily to make it easy to define languages.

Hope this clears it up. Cheers!

Jan 10 '14 17:01 ibudiselic

Hi Ivan,

Thanks for the kind reply. Your explanation makes perfect sense, given that you are using a special regex grammar. I appreciate you taking the time to reply and clarify the Noam expression language.

That said, may I encourage putting a disclaimer at the top of the page? The disclaimer would explain that the simplifier works with Noam regexes and that Noam regexes != Perl compatible regexes. I came across the page directly via a Google search for a regex simplifier. Because I am already familiar with regexes, I didn't really read section 1, so I had no idea it was not processing Perl-style regexes until I tried it out and got unexpected results. The page doesn't currently mention Noam at all until section 5, well after the "meat" of the page, and doesn't clarify there that Noam regexes have their own syntax and grammar. I would anticipate that others will have similar surprises.

Best regards, Ed

On Fri, Jan 10, 2014 at 11:34 AM, Ivan Budiselic [email protected]:

Hello Ed, thanks for the comment.

This is not actually a bug, it's a design choice which I'll try to explain now. Noam doesn't support Perl regexes (or any other particular regex flavor). The language for defining regular expressions in Noam is intentionally extremely simple and minimal, closely akin to something you'd find in any automata/languages textbook. The goal of a Noam regular expression is to define a regular language - nothing more and nothing less. Specifically, the goal of regular expressions in Noam is not to enable users to match and slice up parts of text - that would really be a silly thing to reimplement as JavaScript regexes already do that job.

In this context, defining the start or the end of the string is meaningless - they are both implicitly there at the start and end of the regular expression. You can't search for a match somewhere in your string. You can only test if the whole string is in a language defined by a regular expression or if it is not.

With that in mind, very early on we decided to use the dollar symbol to represent the empty string (usually denoted by epsilon in textbooks) so that regular expressions containing them were more readable and less error prone (for example, you can't define an optional "a" with the regular expression "a?" as you might do normally as the question mark is actually not an operator at all... you'd use something like "a|$" which looks nicer than "a|"). When you're defining a language, epsilons will be much more frequent then empty strings would be in a regex you were using to match text.

You can see a full explanation of the language for defining regular expressions here http://ivanzuzak.info/noam/webapps/regex_simplifier/ or in a comment around the 1450th line here https://github.com/izuzak/noam/blob/master/src/noam.re.js where the string representation of regular expressions is defined. I agree it might be helpful if we made this more explicit in the readme, but Noam started out with finite automata, their manipulation and visualization, and regular expressions were added afterwards primarily to make it easy to define languages.

Hope this clears it up. Cheers!

— Reply to this email directly or view it on GitHubhttps://github.com/izuzak/noam/issues/2#issuecomment-32047901 .

Jan 10 '14 18:01 edcottrell

Thanks for the suggestion, I'm inclined to agree that we should make this clearer.

Ivan

On Fri, Jan 10, 2014 at 7:15 PM, edcottrell [email protected]:

Hi Ivan,

Thanks for the kind reply. Your explanation makes perfect sense, given that you are using a special regex grammar. I appreciate you taking the time to reply and clarify the Noam expression language.

That said, may I encourage putting a disclaimer at the top of the page? The disclaimer would explain that the simplifier works with Noam regexes and that Noam regexes != Perl compatible regexes. I came across the page directly via a Google search for a regex simplifier. Because I am already familiar with regexes, I didn't really read section 1, so I had no idea it was not processing Perl-style regexes until I tried it out and got unexpected results. The page doesn't currently mention Noam at all until section 5, well after the "meat" of the page, and doesn't clarify there that Noam regexes have their own syntax and grammar. I would anticipate that others will have similar surprises.

Best regards, Ed

On Fri, Jan 10, 2014 at 11:34 AM, Ivan Budiselic [email protected]:

Hello Ed, thanks for the comment.

This is not actually a bug, it's a design choice which I'll try to explain now. Noam doesn't support Perl regexes (or any other particular regex flavor). The language for defining regular expressions in Noam is intentionally extremely simple and minimal, closely akin to something you'd find in any automata/languages textbook. The goal of a Noam regular expression is to define a regular language - nothing more and nothing less. Specifically, the goal of regular expressions in Noam is not to enable users to match and slice up parts of text - that would really be a silly thing to reimplement as JavaScript regexes already do that job.

In this context, defining the start or the end of the string is meaningless - they are both implicitly there at the start and end of the regular expression. You can't search for a match somewhere in your string. You can only test if the whole string is in a language defined by a regular expression or if it is not.

With that in mind, very early on we decided to use the dollar symbol to represent the empty string (usually denoted by epsilon in textbooks) so that regular expressions containing them were more readable and less error prone (for example, you can't define an optional "a" with the regular expression "a?" as you might do normally as the question mark is actually not an operator at all... you'd use something like "a|$" which looks nicer than "a|"). When you're defining a language, epsilons will be much more frequent then empty strings would be in a regex you were using to match text.

You can see a full explanation of the language for defining regular expressions here http://ivanzuzak.info/noam/webapps/regex_simplifier/or in a comment around the 1450th line here https://github.com/izuzak/noam/blob/master/src/noam.re.js where the string representation of regular expressions is defined. I agree it might be helpful if we made this more explicit in the readme, but Noam started out with finite automata, their manipulation and visualization, and regular expressions were added afterwards primarily to make it easy to define languages.

Hope this clears it up. Cheers!

— Reply to this email directly or view it on GitHub< https://github.com/izuzak/noam/issues/2#issuecomment-32047901> .

— Reply to this email directly or view it on GitHubhttps://github.com/izuzak/noam/issues/2#issuecomment-32051182 .

Jan 15 '14 09:01 ibudiselic

Agreed, I came looking for the same thing.

Nov 14 '14 22:11 PlNG