pegjs icon indicating copy to clipboard operation
pegjs copied to clipboard

Ability to ignore certain productions

Open chromaticbum opened this issue 15 years ago • 30 comments

It would be nice to be able to tell the lexer/parser to ignore certain productions (i.e. whitespace and comment productions) so that it becomes unnecessary to litter all other productions with comment/whitespace allowances. This may not be possible though, due to the fact that lexing is builtin with parsing?

Thank you

chromaticbum avatar Oct 08 '10 11:10 chromaticbum

Agreed. Is there a clean way to do this at the moment?

benekastah avatar Jul 26 '11 04:07 benekastah

@benekastah: There is no clean way as of now.

This would be hard to do without changing how PEG.js works. Possible solutions include:

  1. Allow to prepend a lexer before the generated parser.
  2. Embed information about ingorined rules somewhere in the grammar. That would probably also mean distinguishing between lexical and syntactical level of the grammar — something I'd like to avoid.

I won't work on this now but it's something to think about in the feature.

dmajda avatar Aug 13 '11 08:08 dmajda

I would need this feature to.

May be you could introduce a "skip"-Token. So if a rule returns that token, it will be ignored and get no node at the AST (aka entry in the array).

tlindig avatar Sep 28 '11 11:09 tlindig

I am looking for a way to do this as well.

I have a big grammar file (it parses the ASN.1 format for SNMP MIB files). I didn't write, it, but I trivially transformed it from the original form to create a parser in PEG.js. (This is good. In fact, it's extremely slick that it took me less than 15 minutes to tweak it so that PEG.js would accept it.)

Unfortunately, the grammar was written with the ability simply to ignore whitespace and comments when it encounters them. Consequently, no real MIB files can be handled, because the parser stops at the first occurrence of whitespace.

I am not anxious to have to figure out the grammar so that I can insert all the proper whitespace in all the rules (there are about 126 productions...) Is there some other way to do this?

NB: In the event that I have to modify the grammar by hand, I asked for help with some quesitons in a ticket in the Google Groups list. http://groups.google.com/group/pegjs/browse_thread/thread/568b629f093983b7

Many thanks!

richb-hanover avatar Oct 01 '11 19:10 richb-hanover

Thanks to the folks over on Google Groups. I think I got enough information to allow me to do what I want.

But I'm really looking forward to the ability in PEG.js to mark whitespace/comments as something to be ignored completely so that I wouldn't have to take a few hours to modify an otherwise clean grammar... Thanks!

Rich

richb-hanover avatar Oct 02 '11 14:10 richb-hanover

I agree with the assertion that pegjs needs the ability to skip tokens. I may look into it, since if you want to write a serious grammar you will get crazy when putting ws between every token.

rioki avatar Aug 13 '13 07:08 rioki

Since the generated parsers are modular. As a workaround, create a simplistic lexer and use it's output as input to the for-real one e.g:

elideWS.pegjs:

s = input:(whitespaceCharacter / textCharacter)* { var result = "";

for(var i = 0;i < input.length;i++) result += input[i]; return result; }

whitespaceCharacter = [ \n\t] { return ""; } textCharacter = c:[^ \n\t] { return c; }

but that causes problems when whitespace is a delimiter -- like for identifiers

waTeim avatar Oct 26 '13 04:10 waTeim

Bumping into this issue quite often. But it's not easy to write a good lexer (you can end up duplicating a good chunk of the initial grammar to have a coherent lexer).

What I was thinking is to be able to define skip rules that can be used as alternatives whenever there's no match. This introduces the need for a non-breaking class though. Example with arithmetics.pegjs using Floats

Expression
  = Term (("+" / "-") Term)*

Term
  = Factor (("*" / "/") Factor)*

Factor
  = "(" Expression ")"
  / Float

Float "float"
  = "-"? # [0-9]+ # ("." # [0-9]+) // # means that skip rules cannot match

// skip rule marked by "!="
// skip rules cannot match the empty string
_ "whitespace"
  != [ \t\n\r]+

Still digesting this. Any feedback? Might be a very stupid idea.

andreineculau avatar Apr 19 '14 19:04 andreineculau

So the difference is that you want to distinguish when the overall engine is operating in lexer mode (whitespace is significant) and when not (whitespace is ignored).

Is there a case when you want to not ignore whitespace when in lexer mode as an option? Or conversely, when not inside a regex? I think no.

Would the following be equivalent?

Float "-?[0-9]+("." [0-9]+)”

or otherwise extend peg to process the typical regex’s directly and outside a quoted string (which includes regexes) whitespace is ignored.

On Apr 19, 2014, at 3:22 PM, Andrei Neculau [email protected] wrote:

Bumping into this issue quite often. But it's not easy to write a good lexer (you can end up duplicating a good chunk of the initial grammar to have a coherent lexer).

What I was thinking is to be able to define skip rules that can be used as alternatives whenever there's no match. This introduces the need for a non-breaking class though. Example with arithmetics.pegjs using Floats

Expression = Term (("+" / "-") Term)*

Term = Factor (("" / "/") Factor)

Factor = "(" Expression ")" / Float

Float "float" = "-"? # [0-9]+ # ("." # [0-9]+) // # means that skip rules cannot match

// skip rule marked by "!=" // skip rules cannot match the empty string _ "whitespace" != [ \t\n\r]+ Still digesting this. Any feedback? Might be a very stupid idea.

— Reply to this email directly or view it on GitHub.

waTeim avatar Apr 22 '14 20:04 waTeim

@waTeim Actually no.

Traditionally the parsing process is split into lexing and parsing. During lexing every character is of significance, that including whitespaces. But these are then lexed into a "discard" token. The parser, when advancing to the next token, will then discard any discard tokens. The important part is that you can discard anything, not just whitespaces. This behavior is exactly what @andreineculau is describing.

The basic idea how to implement this is by needing to additionally check against all discard rules when transitioning from one state to the next.

rioki avatar Apr 23 '14 18:04 rioki

On Apr 23, 2014, at 2:54 PM, Sean Farrell [email protected] wrote:

@waTeim Actually no.

So we agree. The traditional approach is sufficient. There’s no need to have the strictly-parser part recognize the existence of discarded tokens and there’s no reason to make the lexer part behave conditionally (in a context sensitive way) w.r.t. recognizing tokens.

Therefore there’s no need to have glue elements (e.g. ‘#’) in the language because it suffices that

  1. tokens can be created solely from regex and are not context sensitive.
  2. tokens can be marked to be discarded without exception.

Traditionally the parsing process is split into lexing and parsing. During lexing every character is of significance, that including whitespaces. But these are then lexed into a "discard" token. The parser, when advancing to the next token, will then discard any discard tokens. The important part is that you can discard anything, not just whitespaces. This behavior is exactly what @andreineculau is describing.

The basic idea how to implement this is by needing to additionally check against all discard rules when transitioning from one state to the next.

— Reply to this email directly or view it on GitHub.

waTeim avatar Apr 23 '14 19:04 waTeim

Ok then I misunderstood you. There may be cases for lexer states, but that is a totally different requirement and IMHO outside of the scope of peg.js.

rioki avatar Apr 24 '14 08:04 rioki

@waTeim @rioki Forget a bit about my suggestion.

Hands on, take this rule. If you would like to simplify the rule's grammar by taking away the *WS, then how would you instruct PEGjs to not allow *WS between field_name and :?

andreineculau avatar Apr 24 '14 17:04 andreineculau

@andreineculau Because your grammar is whitespace sensitive, this is not applicable. The discard tokens would be part of the grammar, the lexing part to be exact. I don't know what the big issue is here, this was already sufficiently solved in the 70s. Each and every language has it's own skippable tokens and where they are applicable. The whitespaces and comments are as much part of the language definition and thus part of the grammar. It just turn out that with most languages the skippable tokens may be between each and every other token and using a discard rule makes it WAY simpler than writing expr = WS lit WS op WS expr WS ";" for every rule. Just imagine a grammar like the one for C with whitepsace handling?

I understand that retconning discard rules into pegjs is not easy, but that does not mean that it is not a laudable goal.

rioki avatar Apr 26 '14 09:04 rioki

Oh man, free response section! I have a lot to say, so sorry for the length.

  1. For the TL;DR people, if I could add any peg elements, I wanted, I would have written it like this

header_field = field_name ":" field_value

whitespace(IGNORE) = [\t ]+

The addition I’d make is a options section that may be included in any production

The http-bis language would not be limited by this re-write (see appendix a).

  1. My problem with the proposed #

It feels like you are exchanging requiring the user to fill the parser definition with a bunch of discard non terminals (usually whitespace/delimiters) with requiring the user to fill the parser definition with a bunch of “here characters are not discarded” meta-characters unnecessarily. Admittedly there would be fewer occurrences of this. It’s the rare case when people do actually consume delimiters and do something with them, and like I comment in

appendix a, HTTP-bis is not one of those occurrences, just badly documented.

  1. User defined parser states

But I can see how it would be easier on the parser definer to simply cut and paste the language spec from the definition, so if you must have something like this, then this could be done with lexical states as alluded to earlier by Sean. I think i’d do it in the following way.

production1(state==1) = stuff

production2(state==2) = stuff

production3 = stuff {state = 1}

production4 = stuff {state = 2}

In other words, just like lex/yacc make it possible for productions to only be available

if the system is in a particular state, and allow the user to set that state value.

  1. More options

Or you could make it easier on the user and more apparent to the reader with another option

production(DONTIGNORE) = stuff

Which would allow the parser to override the default action of discarding tokens marked as discard but only for that one production. This is really the same as 3, just an easier read. This is less flexible than the # proposal because either a production is all ignore

or no ignore, but I don’t think that extra flexibility is needed.

  1. Adding a parameter to getNextToken() allows context sensitivity

I think all this comes down to is (I’m making some assumptions here) currently, the parser part calls getNextToken(input), and what needs to happen instead is add a

parameter to it getNextToken(input,options).

Appendix a) That HTTP-bis spec

Ok I’ve read some but have not read all of this

Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing draft-ietf-httpbis-p1-messaging-26

I don’t like the way they have defined their grammar. I don’t suggest changing the input it accepts, but I would not have defined it as they did. In particular, I don’t like why they have defined OWS and RWS and BWS which all equate to exactly the same character string but in different contexts. They have defined

OWS ::== (SP | HTAB)* RWS ::== (SP | HTAB)+ BWS ::== OWS

which is just repetition of tabs and spaces

For no good reason. They have made the language harder to parse — require the lexical analyzer to track it’s context — and they didn’t need to do that.

They have defined OWS as “optional white space” BWS as “bad whitespace” or otherwise optional whitespace but in the “bad” context — where it isn’t necessary — and RWS required whitespace where it’s necessary to delimit tokens. Nowhere is this whitespace used except perhaps there might be a parser warning if it matches BWS (“detected unnecessary trailing whitespace” or some such) which is all delimiters do anyway.

In their spec, the only place RWS is used is here

Via = 1#( received-protocol RWS received-by [ RWS comment ] )

 received-protocol = [ protocol-name "/" ] protocol-version
                     ; see Section 6.7
 received-by       = ( uri-host [ ":" port ] ) / pseudonym
 pseudonym         = token

but 'protocol-version' is numbers and maybe letters, while 'received-by' is numbers and letters. In other words, the lexical analyzer is not going to correctly recognize these 2 parts unless they are separated by whitespace and it’s going to be a syntax error with or without RWS being explicitly identified if there is not at least 1 whitespace character. So just remove RWS from the productions altogether and treat whitespace everywhere as a delimiter and it doesn’t change the language, just how it’s documented.

On Apr 24, 2014, at 1:23 PM, Andrei Neculau [email protected] wrote:

@waTeim @rioki Forget a bit about my suggestion.

Hands on, take this rule. If you would like to simplify the rule's grammar by taking away the OWS, then how would you instruct PEGjs to not allow OWS between field_name and :?

— Reply to this email directly or view it on GitHub.

waTeim avatar Apr 26 '14 23:04 waTeim

@waTeim I think you are going overboard with this. I have written quite few parsers and I think the lexer states where never really useful as such. In most cases I saw them was where the lexer consumed block comments and it was "simpler" to put the lexer into "block comment mode" and write simpler patterns than the über pattern to consume the comment (and count lines).

I have never seen any proper use of lexer states stemming from the parser. The fundamental problem here is that with one look ahead, when the parser sees the token to switch states, the lexer has already erroneously lexed the next token. What you propose is almost impossible to implement without back-tracking and that is never a good feature in a parser.

When writing a grammar you basically define which productions are considered parsed and what can be sipped. In @andreineculau's example there are two options, either you handle white spaces in the parser or you define the trailing ":" part of the token. ([a-zA-Z0-9!#$%&'+-.^_|~]+ ":").

rioki avatar Apr 28 '14 09:04 rioki

I might suggest turning the problem into specifying a whitelist—which portions do I want to capture and transform—instead of a blacklist. Although whitespace is one problem with the current capture system, the nesting of rules is another. As I wrote in Issue #66, the LPeg system of specifying what you want to capture directly, via transforms or string captures, seems more useful to me than specifying a handful of productions to skip and still dealing with the nesting of every other production.

See my comment in Issue #66 for a simple example of LPeg versus PEG.js with respect to captures. Although the names are a bit cryptic, see the Captures section of the LPeg documentation for the various ways that you can capture or transform a given production (or portion thereof).

Phrogz avatar Oct 11 '14 03:10 Phrogz

Hello, I've created a snippet to ignore some general cases: null, undefined and strings with only space symbols. It can be required in the head of the grammar file, like:

{
  var strip = require('./strip-ast');
}

The two ways to improve it:

  • Customizable filter for terms — to ignore the specific terms that certain grammar is required.
  • Skip nested empty arrays — this can be done on second stage after strip, it'll remove «pyramids» of nested empty arrays. If anyone interested, we can upgrade it to a package.

StreetStrider avatar Jun 09 '15 14:06 StreetStrider

@richb-hanover Where did your ASN.1 definition parser efforts land?

atesgoral avatar Aug 15 '17 14:08 atesgoral

@atesgoral - I bailed out. I didn't need a "real parser" - I only needed to isolate certain named elements in the target file.

So I did what any wimpy guy would do - used regular expressions. (And then I had two problems :-)

But it did the trick, so I was able to move on to the next challenge. Good luck in your project!

richb-hanover avatar Aug 15 '17 14:08 richb-hanover

Having had a look at chevrotain and its skip option, something like this is hugely desirable.

Too often we find ourselves writing something like this:

Pattern = head:PatternPart tail:( WS "," WS PatternPart )*
{
  return {
    type: 'pattern',
    elements: buildList( head, tail, 3 )
  };
}

Would be cool if we could write this instead:

WS "whitespace" = [ \t\n\r] { return '@@skipped' }

IgnoredComma = "," { return '@@skipped' }

Pattern = head:PatternPart tail:( WS IgnoredComma WS PatternPart )*
{
  return {
    type: 'pattern',
    elements: [head].concat(tail)
  };
}

Izhaki avatar Aug 28 '18 21:08 Izhaki

@richb-hanover, and anybody else who got here in search of a similar need, I ended up writing my own parsers, too: https://www.npmjs.com/package/asn1exp and https://www.npmjs.com/package/asn1-tree

atesgoral avatar Aug 29 '18 00:08 atesgoral

A skip would be relatively easy to implement using es6 symbol, or maybe more durably by passing the parser a predicate at parse time (I prefer the latter option)

StoneCypher avatar Feb 03 '20 03:02 StoneCypher

Just stumbled upon this too. Not knowing anything about the innards of PEG.js, lemme throw a bone out there...

When we write a rule, at the end of it we can add a return block. In that block, we can call things like text() and location(). These are internal functions.

Somewhere in the code the returned value of that block goes into the output stream.

So what would need to change in PEG.js if I want to skip a value returned by a rule if that value is the return of callking a skip local function ?

e.g. comment = "//" space ([^\n])* newline { return skip() }

As mentioned above, skip() could return a Symbol, which is then checked by the code somewhere and removed. Something like what lzhaki said, but internal to the library

darlanalves avatar Dec 02 '20 23:12 darlanalves

I don't understand your question. Are you looking for a way to fail a rule under some circumstances? Use &{...} or !{...}. Otherwise just don't use returned value of comment rule:

seq = comment r:another_rule { return r; };
choice = (comment / another_rule) { <you need to decide what to return instead of "comment" result> };

Mingun avatar Dec 03 '20 03:12 Mingun

If it helps anyone, I ignore whitespace by having my top-level rule filter the array of results.

Example:


program
	= prog:expression+ {return prog.filter(a => a)}

expression
	= float
    / number
    / whitespace

float
	= digits:(number"."number) {return parseFloat(digits.join(""),10)}
    
number 
	= digits:digit+ {return parseInt(digits.join(""),10)}
    
digit 
	= [0-9]
    
whitespace
	= [ \t\r\n] {return undefined}

This will happily parse input while keeping whitespace out of the result array. This will also work for things like comments, just have the rule return undefined and the top level rule will filter it out

stoneRdev avatar Apr 14 '21 23:04 stoneRdev

That only works for top-level productions. You have to manually filter every parent that could contain a filterable child.

StoneCypher avatar Apr 15 '21 00:04 StoneCypher

@StoneCypher True, it does require some top level work, but it works for me, and I think as long as the gammar isn't too complex, one should be able to get away with having a top level filter.

Other than that, all I can think of is have a top level function that filters whitespace from input and pass every match through it. Slower for sure, and requires alot more calls, but easy if you (like me) pass everything into a token generator. You can call the filter function from where you generate tokens, and only have to worry about generating your tokens and the whitespace is more or less automatically filtered

stoneRdev avatar Apr 15 '21 19:04 stoneRdev

One of the things I liked about the current HEAD of pegjs is its (undocumented) support for picking fields without having to create labels and do return statements. It looks like this:

foo = @bar _ @baz
bar = $"bar"i
baz = $"baz"i
_ = " "*
parse('barbaz') // returns [ 'bar', 'baz' ]

I feel like this gives nice, clean, explicit syntax for this use case plus a bunch of others.

hildjj avatar Apr 15 '21 19:04 hildjj

@hildjj This is exactly what I needed in combination with parsing lists. Peggy is wonderful, thank you for your effort!

I guess, this example also would make a good candidate for the documentation page, as it illustrates the usage of @ (which I didn't quite understand from reading the docs).

markus-fe avatar Jun 13 '21 14:06 markus-fe