coq-serapi Tests for tokenization off-by-one locs

Consider the following sertop session with the latest SerAPI from the v8.10 branch (formatted for readability and with unnecessary output removed):

(Add () "Notation \"[ 'arg[' ord ]\" := (ord + 1).")
(Tokenize "Notation \"[ 'arg[' ord ]_( x )\" := ([arg[ord] + x).")
(Answer 1
 (ObjList
  ((CoqTok
    (((v(IDENT Notation))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 0)(ep 8)))))
     ((v(STRING"[ 'arg[' ord ]_( x )"))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 9)(ep 31)))))
     ((v(KEYWORD :=))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 32)(ep 34)))))
     ((v(KEYWORD"("))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 35)(ep 36)))))
     ((v(KEYWORD"["))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 36)(ep 37)))))
     ((v(KEYWORD"arg["))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 37)(ep 40)))))
     ((v(IDENT ord))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 41)(ep 44)))))
     ((v(KEYWORD"]"))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 44)(ep 45)))))
     ((v(KEYWORD +))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 46)(ep 47)))))
     ((v(IDENT x))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 48)(ep 49)))))
     ((v(KEYWORD")"))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 49)(ep 50)))))
     ((v(KEYWORD .))(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 50)(ep 51)))))
     ((v EOI)(loc(((fname ToplevelInput)(line_nb 1)(bol_pos 0)(line_nb_last 1)(bol_pos_last 0)(bp 51)(ep 52))))))))))

Note that the size of the tokens are ep - bp for all tokens but KEYWORD"arg[", which by this measure has size 3, clearly incorrect.

@ejgallego could this be a bug in how Coq produces the locs?

Jun 24 '19 18:06 palmskog

@palmskog I need to look deeper into this but surely a bug in Coq. When I first tweaked the parsing API to expose locs we found many bugs, basically the loc data was not tested for a long time. @maximedenes did a lot of work to fix all the wrong loc problems back then so he's the expert there.

Jun 24 '19 19:06 ejgallego

@maximedenes should I open a Coq issue about this (even though I don't know how to reproduce it without SerAPI)?

Here is an extrapolation of the above example that still reports size 3 via locs for the arg...:

Notation "[ 'arg|||' ord ]" := (ord + 1).
Notation "[ 'arg|||' ord ]_( x )" := ([arg|||ord] + x).

Jun 24 '19 19:06 palmskog

I don't know how to reproduce it without SerAPI

You could likely try to do it from Drop..

Jun 24 '19 19:06 ejgallego

I think we at least figured out the overall circumstances for when this problem occurs and can work around it. Specifically, it's when a notation or tactic notation defines a "token" where there is a prefix that follows the "identifier" conventions (e.g., only letters). Only the location of this prefix is reported through the token stream.

Another example:

Tactic Notation "near=>" ident(x) := intros x.

Now, if we tokenize near=> x, the location of the keyword near=> is reported as the location for only near, i.e., = terminated the "identifier" pseudo-token.

Jun 24 '19 22:06 palmskog

Umm, do you have a regular self-contained simple example using just Notation, such that we can compare the actual output with the correct output?

Oct 31 '19 20:10 ejgallego

As you suggested in #161, It may be easiest if I create a PR adding a test case using sertok on a .v file that includes the expected results and demonstrates the discrepancy.

Oct 31 '19 20:10 palmskog

8.12 changed a few things on this, let's reevaluate on the 0.12 branch is ready.

May 27 '20 10:05 ejgallego

@maximedenes should I open a Coq issue about this (even though I don't know how to reproduce it without SerAPI)?

If a bug is confirmed on the bug side, feel free to report it even if the reproduction is a bit involved (like it requires SerAPI).

Aug 28 '20 12:08 maximedenes

coq-serapi coq-serapi copied to clipboard

Tests for tokenization off-by-one locs

coq-serapi
coq-serapi copied to clipboard