parsimmon
parsimmon copied to clipboard
Parsimmon.index yields incorrect position for tokens right before a newline ('\n')
const parser = P.seqMap(P.index, P.digits, P.index, function (start, value, end) {
console.log(start, value, end);
return value;
});
parser.parse('1234\n');
parser.parse('1234');
outputs:
{offset: 0, line: 1, column: 1} '1234' {offset: 4, line: 2, column: 0} {offset: 0, line: 1, column: 1} '1234' {offset: 4, line: 1, column: 5}
I suppose these two outputs should be identical.
Hi @hillin,
This is actually worked as intended. Since Parsimmon has already finished parsing 1234
, the "cursor position" is on the \n
1 2 3 4 \n
^^
After successfully parsing a string, Parsimmon advances to the next character. I understand why this is confusing, but this is both internally easier and what a lot of APIs expect for source code ranges.
The second parse is actually past the end of the document for this reason as well.
Thanks @wavebeem . Let's extend the example a little bit:
const parser = P.seqMap(P.index, P.digits, P.index, P.newline, P.index, function (start, value, middle, newline, end) {
console.log(start, value, middle, newline, end);
return value;
});
parser.parse('1234\n');
The output is:
{offset: 0, line: 1, column: 1} '1234' {offset: 4, line: 2, column: 0} '\n' {offset: 5, line: 2, column: 1}
This shows that Parsimmon treats the newline character \n
as the beginning of a new line, rather than the end of the previous line, which is the counterintuitive part to me. Technically \n
is the control character that instructs the device (typewriter, or a text program) to start a new line, the new line must happen after it, so \n
has to be with the previous line. That's why it's also called the end of line (EOL).
In reality, we are using Parsimmon to create a parser for a DSL, which then will be used in the monaco editor. Parsimmon's interpretation of newline has created a little difficulty to us when working with monaco's text range API (e.g. https://microsoft.github.io/monaco-editor/api/interfaces/monaco.editor.IMarkerData.html), which does not work as expected if the end of a range is, as P.index
says, at column 0 of the next line, rather than at the last column (+1) of the current line. This makes P.index
less ideal for marking text ranges, but we don't have a better alternative. All we can do is to fix those indices manually, which has to be done in another pass after parsing because it's hard to know the length of the previous line while parsing.
I can see the argument both ways and I'm pretty sure I've used software that expects the end of the token to exclusive rather than inclusive.
var index = Parsimmon(function(input, i) {
return makeSuccess(i, makeLineColumnIndex(input, i));
});
///////////////////////////////////////////////////////////////////////////
var lastIndex = Parsimmon(function(input, i) {
return makeSuccess(i, makeLineColumnIndex(input, i - 1));
});
///////////////////////////////////////////////////////////////////////////
At a glance this seems to do what you want. You'll have to add it directly into parsimmon since makeLineColumnIndex
is not exposed. With this approach, you can use index
at the start of a token and lastIndex
after the token to get the full inclusive range, instead of exclusive.
Due to the ridiculous complexity of this function, it's not something I'm interested in exposing at this time, but I could maybe see the case for a P.lastIndex
. There are entirely too many nearly useless functions in Parsimmon, but this one would be extremely tricky to do efficiently outside of core. So I would consider the feature if you want to make a pull request with tests and documentation for it.
P.node
and P.mark
would still maintain their existing behavior for backwards compatibility though, of course.
Strictly speaking my suggestion still won't use your preferred logic for newline line/column numbering, but it will skip the issue as long as you don't actually need to include the newline in your range.
Otherwise, if you really really want the P.index
numbering to change and not just add a new mode, that's not something I would entertain at this time since it's a breaking change.
Well I think here is the misunderstanding: I'm totally OK to use an exclusive end index for a token, and I consent it's the right way to do this. My point is, the end index should not span into the next line if the token does not.
So for the example in the original post, for both input I'd expect the end index of 1234
to be at {line: 1, column: 5}
(which is exclusive), rather than {line: 2, column: 0}
, because again, the token 1234
did not have any part of it laid at line 2.
Here is another example: for the input 1234\n5678
, I'd expect it to be tokenized as:
Token | Expected Range | Current Range |
---|---|---|
1234 |
L1C1 - L1C5 | L1C1 - L2C0 |
\n |
L1C5 - (L2C0 or L1C6) | L2C0 - L2C1 |
5678 |
L2C1 - L2C5 | L2C1 - L2C5 |
- The current end indexing of
1234
is not ideal, because the token never touches line 2, yet we still say it ends at the beginning of line 2, which may be true but not really helpful. I'd prefer it ends at the end of line 1. - The
\n
is also less than ideal, because although I can't find a specification, intuitively I feel it should be the last character of line 1, rather than the first (actually, 0th as all the indices are 1-based) character of line 2. I'll leave this one debatable. -
5678
is indexed as expected, but only if it's not followed by a\n
.
the end index should not span into the next line if the token does not
This still seems to me like the ideal solution is to use inclusive ranges.
I feel it should be the last character of line 1
I see your point more and more. In UNIX, a line is terminated with a newline, not separated by it. it definitely feels like it's part of the line, like you said. This change won't happen within Parsimmon v1 since it would be a breaking change, but I'll definitely think about this going forward.
I'm going to reference this issue in https://github.com/jneen/parsimmon/issues/230 because it's important to think about for v2.
I may end up implementing a change for this in my sister library https://github.com/wavebeem/bread-n-butter/issues/34 first to see how I feel about it. But I don't plan on working on it any time soon.
This still seems to me like the ideal solution is to use inclusive ranges
To me exclusive ending index is still the legitimate way to go. We just need some kind of virtual index, which does not actually map to a character in the source. This is the standard design of almost all range constructs, the Range in javascript selection API, System.Range in .net and range() in python to name a few.
One thought is, we don't have to change the behavior of Parsimmon.index
, which semantically means the current text pointer in the scanner and should reflect how Parsimmon interprets \n
(which, as you said, is breaking to change thus not desirable in v1); but we can have another function to correct mark the range of a parsed result. E.g.
function mark<T>(parser: Parser<T>): Parser<{value: T; begin: Index; end: Index}>;
// in which the end index should be the ideal result mentioned above
Actually this could be very useful (in my use cases) because I find myself always using the Parsimmon.index
in pair to mark ranges.
The mark method already does this. But .mark doesn't take any options right now. You could add an options object to that method that implements the alternative line number behavior you suggested.
I would certainly review the PR if you want to work on it.
https://github.com/jneen/parsimmon/blob/master/API.md#parsermark
The mark method already does this
Oops, I've missed that one.
I would certainly review the PR if you want to work on it.
I'll see what I can do with it.
BTW for now we've switched to line-based parsing, as a workaround of this issue; as well as better supporting for partially tokenization (as expected by the monaco editor).