coq-serapi
coq-serapi copied to clipboard
Tests for tokenization off-by-one locs
Consider the following sertop
session with the latest SerAPI from the v8.10
branch (formatted for readability and with unnecessary output removed):
(Add () "Notation \"[ 'arg[' ord ]\" := (ord + 1).")
(Tokenize "Notation \"[ 'arg[' ord ]_( x )\" := ([arg[ord] + x).")
(Answer 1
(ObjList
((CoqTok
(((v(IDENT Notation))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 0)(ep 8)))))
((v(STRING"[ 'arg[' ord ]_( x )"))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 9)(ep 31)))))
((v(KEYWORD :=))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 32)(ep 34)))))
((v(KEYWORD"("))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 35)(ep 36)))))
((v(KEYWORD"["))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 36)(ep 37)))))
((v(KEYWORD"arg["))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 37)(ep 40)))))
((v(IDENT ord))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 41)(ep 44)))))
((v(KEYWORD"]"))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 44)(ep 45)))))
((v(KEYWORD +))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 46)(ep 47)))))
((v(IDENT x))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 48)(ep 49)))))
((v(KEYWORD")"))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 49)(ep 50)))))
((v(KEYWORD .))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 50)(ep 51)))))
((v EOI)(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 51)(ep 52))))))))))
Note that the size of the tokens are ep
- bp
for all tokens but KEYWORD"arg["
, which by this measure has size 3, clearly incorrect.
@ejgallego could this be a bug in how Coq produces the locs?
@palmskog I need to look deeper into this but surely a bug in Coq. When I first tweaked the parsing API to expose locs we found many bugs, basically the loc data was not tested for a long time. @maximedenes did a lot of work to fix all the wrong loc problems back then so he's the expert there.
@maximedenes should I open a Coq issue about this (even though I don't know how to reproduce it without SerAPI)?
Here is an extrapolation of the above example that still reports size 3 via locs for the arg...
:
Notation "[ 'arg|||' ord ]" := (ord + 1).
Notation "[ 'arg|||' ord ]_( x )" := ([arg|||ord] + x).
I don't know how to reproduce it without SerAPI
You could likely try to do it from Drop.
.
I think we at least figured out the overall circumstances for when this problem occurs and can work around it. Specifically, it's when a notation or tactic notation defines a "token" where there is a prefix that follows the "identifier" conventions (e.g., only letters). Only the location of this prefix is reported through the token stream.
Another example:
Tactic Notation "near=>" ident(x) := intros x.
Now, if we tokenize near=> x
, the location of the keyword near=>
is reported as the location for only near
, i.e., =
terminated the "identifier" pseudo-token.
Umm, do you have a regular self-contained simple example using just Notation
, such that we can compare the actual output with the correct output?
As you suggested in #161, It may be easiest if I create a PR adding a test case using sertok
on a .v
file that includes the expected results and demonstrates the discrepancy.
8.12 changed a few things on this, let's reevaluate on the 0.12 branch is ready.
@maximedenes should I open a Coq issue about this (even though I don't know how to reproduce it without SerAPI)?
If a bug is confirmed on the bug side, feel free to report it even if the reproduction is a bit involved (like it requires SerAPI).