`RFC3986` Path rule issue?
It seems that path should also be reordered, right?
path-abempty always matches any string as it accepts 0 iterations.
This is a snippet I am referring to:
"path = path-abempty / path-absolute / path-noscheme / path-rootless / path-empty",
'path-abempty = *( "/" segment )',
'path-absolute = "/" [ segment-nz *( "/" segment ) ]',
'path-noscheme = segment-nz-nc *( "/" segment )',
'path-rootless = segment-nz *( "/" segment )',
"path-empty = 0pchar",
path-empty also looks similar to path-abempty. Do you think it would be better to reorder the rules in the following order:
path-absolute / path-noscheme / path-rootless / path-abempty and remove path-empty?
Did you encounter a problem because of this? I think the current version of abnf has enough backtracking to handle this case. Probably we need a test or two to ensure that this isn't a problem.
Did you encounter a problem because of this? I think the current version of abnf has enough backtracking to handle this case.
Hey, thank you for a swift reply!
Actually, nope, don't have any problems so far, but I have not tested for this specific scenario.
It just looked peculiar when I read the ABNF notation, so I may be utterly wrong here.
Sorry if I confused you!
You're not wrong to make this observation; there are several rules in the various RFCs akin to this one. I should add a test for this rule to ensure that the parser generated for RFC3886 does what one expects.
I saw that you reordered dec-octet and that is what I had in mind when I was googling for problems with dec-octet. This is actually how I found your repo in the first place 😅
I am talking about this part of the code
# alternatives reordered for correct matching.
'dec-octet = "25" %x30-35 / "2" %x30-34 DIGIT / "1" 2DIGIT / %x31-39 DIGIT / DIGIT',
(On a side note, really impressive project!)
Another problem within RFC3986 I am feeling is happening is path rule, and I was wondering if you had any issues with this rule or noticed something similar?
At least it seems that path-abempty and path-empty are similar when matching for an empty string. Given both of them match for it, path-abempty should always win. (not a big problem though that one of the rules is not used).
At the same time, it seems that one can match path-abempty with literally any string as it may converge to matching an empty substring, and given that the query and fragment are not the children of path rule, I am not 100% sure how it is resolved.
Take this with a grain of salt, as I may be wrong, though would be happy if it turns out to be helpful.
@dekkers I actually figured it out, sorry for the hustle.
It seems that path rule itself is not used anywhere in the grammar.
Path children expressions are used in heir-part and relative-part
...
'hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty',
...
'relative-part = "//" authority path-abempty / path-absolute / path-noscheme / path-empty',
Though in these rules path-abempty is protected by terminal // symbols that help to select between alterations.
Really confusing definitions! So URI-rule should be safe and sound.
Though the issues hold for path-rule.
The snippet below (with empty string) always produces AST with one child: path-abempty, so it does not reach path-empty.
src = ''
node, start = rfc3986.Rule('path').parse(src, 0)
print(str(node))
Node(name=path, children=[Node(name=path-abempty, children=[])])