crash with Fatal error: exception Zed_string.Invalid
I use utop in combination with QCheck - for randomized property-based testing.
However, sometimes when I inspect or illustrate QCheck's ability to produce random strings I encounter a crash.
To reproduce:
$ utop
───────────────────────────────┬─────────────────────────────────────────────────────────────┬───────────────────────────────
│ Welcome to utop version 2.4.3 (using OCaml version 4.09.0)! │
└─────────────────────────────────────────────────────────────┘
Findlib has been successfully loaded. Additional directives:
#require "package";; to load a package
#list;; to list the available packages
#camlp4o;; to load camlp4 (standard syntax)
#camlp4r;; to load camlp4 (revised syntax)
#predicates "p,q,...";; to set these predicates
Topfind.reset();; to force that packages will be reloaded
#thread;; to enable threads
Type #utop_help for help about using utop.
─( 13:18:23 )─< command 0 >───────────────────────────────────────────────────────────────────────────────────{ counter: 0 }─
utop # #require "qcheck";;
─( 13:18:23 )─< command 1 >───────────────────────────────────────────────────────────────────────────────────{ counter: 0 }─
utop # open QCheck;;
─( 13:18:30 )─< command 2 >───────────────────────────────────────────────────────────────────────────────────{ counter: 0 }─
utop # Random.init 12; Gen.generate1 ~rand:(Random.get_state ()) Gen.small_string;;
Fatal error: exception Zed_string.Invalid("at position 157: individual combining marks encountered", "- : string =\n\"\195\134x`Ihq\196\178$&\\031\\015\\026k*\\031\194\185\\t\194\179x~jm\195\136i\195\185\195\175\195\166\195\137/\195\156\\025(\194\168\\\\\194\146\195\160^\195\181\194\131[\195\155\195\128\195\161\194\186(5\\021\194\175\202\167\195\180\195\148\195\158\195\1551Y_/E\194\164\194\163\194\191`\195\160h\194\166\195\174\195\153#\194\144\195\188\195\164rG\\026\194\160\194\173\195\152(\194\176\194\158\214\149\\029\194\168G\194\150\\030\194\141\"\n")
Note how I explicitly set the random seed to 12 to attempt something reproducable.
On my machine (MacBook Pro, 64-bit, 3,1 GHz Intel Core i5, w/MacPorts), this consistently crashes utop.
Thus if this particular seed does not crash your machine, try another.
I just started from 1 myself, before I hit the number 12.
I've then managed to reduce the crashing string into the following minimal one:
utop # "\158\214\149";;
Fatal error: exception Zed_string.Invalid("at position 16: individual combining marks encountered", "- : string = \"\194\158\214\149\"\n")
Since I'm not familiar with utop's internals, I cannot tell if this may be a bug in Zed.
To confirm that this remains an issue, I can reproduce it on utop v.2.6.0:
$ utop
───────────────────────────────┬─────────────────────────────────────────────────────────────┬───────────────────────────────
│ Welcome to utop version 2.6.0 (using OCaml version 4.09.0)! │
└─────────────────────────────────────────────────────────────┘
Findlib has been successfully loaded. Additional directives:
#require "package";; to load a package
#list;; to list the available packages
#camlp4o;; to load camlp4 (standard syntax)
#camlp4r;; to load camlp4 (revised syntax)
#predicates "p,q,...";; to set these predicates
Topfind.reset();; to force that packages will be reloaded
#thread;; to enable threads
Type #utop_help for help about using utop.
─( 14:23:08 )─< command 0 >───────────────────────────────────────────────────────────────────────────────────{ counter: 0 }─
utop # "\158\214\149";;
Fatal error: exception Zed_string.Invalid("at position 16: individual combining marks encountered", "- : string = \"\194\158\214\149\"\n")
So I tried with a backtrace:
$ env OCAMLRUNPARAM=b utop
───────────────────────────────┬─────────────────────────────────────────────────────────────┬───────────────────────────────
│ Welcome to utop version 2.6.0 (using OCaml version 4.09.0)! │
└─────────────────────────────────────────────────────────────┘
(* ... *)
utop # "\158\214\149";;
Fatal error: exception Zed_string.Invalid("at position 16: individual combining marks encountered", "- : string = \"\194\158\214\149\"\n")
Raised at file "src/zed_string.ml", line 23, characters 29-88
Called from file "src/zed_string.ml", line 211, characters 14-30
Called from file "src/zed_string.ml", line 295, characters 23-43
Called from file "src/lTerm_text_impl.ml", line 24, characters 65-96
Called from file "src/lib/uTop_main.ml", line 301, characters 17-42
Called from file "src/lib/uTop_main.ml", line 810, characters 30-61
Re-raised at file "parsing/location.ml", line 898, characters 22-25
Called from file "src/lib/uTop.ml", line 114, characters 2-11
Called from file "src/lib/uTop_main.ml", line 816, characters 21-61
Called from file "src/lib/uTop_main.ml", line 1555, characters 8-17
Called from file "src/lib/uTop_main.ml", line 1570, characters 4-25
It appears this conversion https://github.com/ocaml-community/utop/blob/master/src/lib/uTop_main.ml#L301
helps assemble escape codes into "prettier utf8 glyphs" (if I understand correctly). The lTerm call
https://github.com/ocaml-community/lambda-term/blob/master/src/lTerm_text_impl.ml#L30
itself calls Zed_string.unsafe_of_utf8.
Since a conversion failure is possible (we are calling an unsafe function), I guess a reasonable fix would be to keep the "unstyled string" in this case. I don't know if a "partial styling" is possible.
I understand that utop is maintained by the community.
However I would appreciate if someone more familiar with the internals could briefly (a) confirm that I'm on the right track (and ideally: the bug), or (b) suggest an alternative approach for a fix.
Hi,
Yes, I agree that it's a bug and that printing a string should never result in an exception.
The core issue seems to be that ocaml strings are just a sequence of bytes, and many operations that utop does only work strings that are valid utf8 encodings. That's why there are of_string and of_utf8 functions.
When rendering an output phrase, we only have a string, and we don't know if it's valid utf8. So it seems that there's a sort of preprocessing phase:
https://github.com/ocaml-community/utop/blob/7bc5117d3449fc19f5c706a6decfdb2a30984507/src/lib/uTop_main.ml#L300-L301
My understanding is that fix_string is expected to return a valid utf8 string, but in some cases (found by your fuzzer), it does not work. So, it looks like patching fix_string could be a solution too.
I just noticed that in the case of long strings, it just does not attempt to do any styling:
https://github.com/ocaml-community/utop/blob/7bc5117d3449fc19f5c706a6decfdb2a30984507/src/lib/uTop_main.ml#L297-L298
This can be a reasonable fallback in case of an exception if there's no easy way to alter fix_string's behaviour.
OK, I've now taken a look at this.
I've managed to print the (invalid uft-8) string which ends up as question marks in the terminal.
However I have not been able to do so while maintaining utop's coloring of the output ("orange for strings").
This is not exactly optimal, as printing"\158\214\149" with any of the three characters removed brings back color...
My best attempt is to wrap the relevant part of render_out_phrase in an exception handler as follows:
let render_out_phrase term string =
if String.length string >= 100 * 1024 then
LTerm.fprint term string
else
try
begin
let string = fix_string string in
let styled = LTerm_text.of_utf8 string in
let stylise loc token_style =
for i = loc.idx1 to loc.idx2 - 1 do
let ch, style = styled.(i) in
styled.(i) <- (ch, LTerm_style.merge token_style style)
done
in
UTop_styles.stylise stylise (UTop_lexer.lex_string string);
LTerm.fprints term styled
end
with (Zed_string.Invalid _) ->
(Printf.printf "got you!\n%!";
LTerm.fprint term (fix_string string))
I think I'm bitten by this design decision of LambdaTerm:
"All these functions accept only valid UTF-8 strings (or unicode styled text)" https://ocaml-community.github.io/lambda-term/3.1.0/lambda-term/LTerm/index.html#printing
This means a LTerm.fprints-printer accepting a LTerm_text.t will end up having to convert the string first - which rethrows the exception 🤔
Further input is welcome!
I also just hit this issue, on a longer input (~350 bytes) but with the same error:
exception Zed_string.Invalid("at position 331: individual combining marks encountered", "val contents : string =\n \"y\.....