coq2html coq2html does not like Unicode

coq2html gets confused when unicode characters are used in identifiers. Here is a small example

Inductive ωtruth :=
| ωtrue : ωtruth
| ωfalse : ωtruth
| otrue : ωtruth
| ofalse : ωtruth.
Definition otruth := ωtruth.

Reserved Notation  "'¬' t" (at level 50).
Definition ωneg (ω:ωtruth) : ωtruth :=
  match ω with
  | ωtrue => ωfalse
  | ωfalse => ωtrue
  | otrue => ofalse
  | ofalse => otrue
  end.
Notation  "'¬' x" := (ωneg x).
Definition oneg := ωneg.

Lemma ωdoubleneg :
  forall (ω:ωtruth), ¬(¬ω) = ω.
Proof.
  destruct ω; reflexivity.
Qed.

Lemma odoubleneg :
  forall (o:otruth), oneg (oneg o) = o.
Proof.
  destruct o; reflexivity.
Qed.

The o-prefixed identifiers are handled properly but not the ω-prefixed ones.

Mar 05 '18 17:03 jeromesimeon

I'm afraid no attention to Unicode was paid so far in the implementation of coq2html! Indeed, the regexp for identifiers https://github.com/xavierleroy/coq2html/blob/d68b757b5bcc51a17e57c8301e862640db521e7f/coq2html.mll#L274 is ASCII-only. I need to look at how Coq itself handles Unicode in identifiers.

Mar 05 '18 18:03 xavierleroy

I don't recall how ocamllex does with unicode. I had a very good experience with Alain Frish's sedlex https://github.com/alainfrisch/sedlex

Mar 05 '18 18:03 jeromesimeon

ocamllex doesn't know anything about Unicode but it might be able to do something with the UTF-8 encoding of Unicode. I, too, was thinking of ulex and its successor sedlex.

Mar 05 '18 18:03 xavierleroy