node-ebnf icon indicating copy to clipboard operation
node-ebnf copied to clipboard

Unable to parse valid W3C EBNF

Open shellscape opened this issue 3 years ago • 8 comments

The grammar located here https://github.com/transpect/css-tools/blob/master/ebnf-scheme/CSS3.ebnf is valid W3C EBNF, as verified on railroad https://bottlecaps.de/rr/ui. This package throws an error that it could not parse the grammar at /node_modules/ebnf/dist/Grammars/W3CEBNF.js:288:19.

So it looks like there are some compatibility issues. Perhaps the grammar for W3C is out of date, given the age of the package?

shellscape avatar Jul 13 '22 13:07 shellscape

Additionally, this package cannot parse the EBNF grammar that railroad shows on its site:

import { Grammars } from 'ebnf';

const w3grammar = `Grammar ::= Production*
Production ::= NCName '::=' ( Choice | Link )
NCName ::= [http://www.w3.org/TR/xml-names/#NT-NCName]
Choice ::= SequenceOrDifference ( '|' SequenceOrDifference )*
SequenceOrDifference ::= (Item ( '-' Item | Item* ))?
Item ::= Primary ( '?' | '*' | '+' )*
Primary ::= NCName | StringLiteral | CharCode | CharClass | '(' Choice ')'
StringLiteral ::= '"' [^"]* '"' | "'" [^']* "'"
/* ws: explicit */
CharCode ::= '#x' [0-9a-fA-F]+
CharClass ::= '[' '^'? ( Char | CharCode | CharRange | CharCodeRange )+ ']'
Char ::= [http://www.w3.org/TR/xml#NT-Char]
CharRange ::= Char '-' ( Char - ']' )
CharCodeRange ::= CharCode '-' CharCode
Link ::= '[' URL ']'
URL ::= [^#x5D:/?#]+ '://' [^#x5D#]+ ('#' NCName)?
Whitespace ::= S | Comment
S ::= #x9 | #xA | #xD | #x20
Comment ::= '/*' ( [^*] | '*'+ [^*/] )* '*'* '*/'`;

const rules = Grammars.W3C.getRules(w3grammar);

This also fails with throw new Error('Could not parse ' + source); at the same line and position.

shellscape avatar Jul 13 '22 13:07 shellscape

Hello, Can you try ending thr document/grammar string with a line ending char?

menduz avatar Jul 15 '22 12:07 menduz

Your Char production looks hosed:

Char ::= [http://www.w3.org/TR/xml#NT-Char]

(A URL doesn't belong in a bracket expression.)

kjhughes avatar Jul 27 '22 19:07 kjhughes

@kjhughes that's straight from W3C

shellscape avatar Jul 27 '22 20:07 shellscape

The RHS is clearly meant to be metadata / documentation, not an EBNF regex. The URL references this EBNF:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

kjhughes avatar Jul 27 '22 20:07 kjhughes

@menduz : Just tried adding a newline at the end and that seemed to do the trick!

Might be worthwhile to not fail on no final newline character?

jgeewax avatar Oct 05 '22 06:10 jgeewax

I've tried adding a newline and still not having any success. Also been trying to parse https://github.com/messagetemplates/grammar/blob/master/message-template.ebnf without success.

jimmcslim avatar Oct 28 '22 05:10 jimmcslim

Yes, adding a new line on the end of a string is a great tip! Additionally, even though the parser only give you a yes/no as to whether is parsed successfully or not, you can quickly narrow down the problem in the playground

https://menduz.github.io/ebnf-highlighter/

by starting with just one line at a leaf or your parse tree and building your ebnf file back up from there.

e.g. does this parse?

_LETTER-OR-DIGIT ::= [A-Za-z0-9]

No. How about this?

_LETTERORDIGIT ::= [A-Za-z0-9]

No. How about now?

LETTERORDIGIT ::= [A-Za-z0-9]

Yes. So does W3C EBNF not support an NCName entity starting with an underscore? Well, let's look at the node-ebnf source code, this is the top of W3CEBNF.ts

// https://www.w3.org/TR/REC-xml/#NT-Name
// http://www.bottlecaps.de/rr/ui

// Grammar	::=	Production*
// Production	::=	NCName '::=' Choice
// NCName	::=	[http://www.w3.org/TR/xml-names/#NT-NCName]
// Choice	::=	SequenceOrDifference ( '|' SequenceOrDifference )*
// SequenceOrDifference	::=	(Item ( '-' Item | Item* ))?
// Item	::=	Primary ( '?' | '*' | '+' )?
// Primary	::=	NCName | StringLiteral | CharCode | CharClass | '(' Choice ')'
// StringLiteral	::=	'"' [^"]* '"' | "'" [^']* "'"
// CharCode	::=	'#x' [0-9a-fA-F]+
// CharClass	::=	'[' '^'? ( RULE_Char | CharCode | CharRange | CharCodeRange )+ ']'
// RULE_Char	::=	[http://www.w3.org/TR/xml#NT-RULE_Char]
// CharRange	::=	RULE_Char '-' ( RULE_Char - ']' )
// CharCodeRange	::=	CharCode '-' CharCode
// RULE_WHITESPACE	::=	RULE_S | Comment
// RULE_S	::=	#x9 | #xA | #xD | #x20
// Comment	::=	'/*' ( [^*] | '*'+ [^*/] )* '*'* '*/'

That tells us to look it up here: http://www.w3.org/TR/xml-names/#NT-NCName

click through to the Name: https://www.w3.org/TR/REC-xml/#NT-Name

click through to the NameStartChar: https://www.w3.org/TR/REC-xml/#NT-NameStartChar

Oh dear, it does look to me like you're supposed to be able to start an NCName entity with an underscore. So it does seem a shame that node-ebnf won't parse this. But hopefully what I've been able to demostrate about how I would isolate a fault and investigate the cause is helpful?

Antony74 avatar Jun 26 '23 14:06 Antony74