cakeml icon indicating copy to clipboard operation
cakeml copied to clipboard

Implement lexer on mlstrings

Open xrchz opened this issue 7 years ago • 3 comments

The CakeML lexer operates on a char list. Therefore the first thing the CakeML compiler does after reading in its input string is explode the string into a list. This is wasteful: I think the lexer can easily be made to work on mlstrings.

This issue is to make the lexer operate on mlstrings and update the compiler accordingly to avoid the explode.

Extension thoughts:

  • The sexp-parsing option would still need to explode the input because the sexp parser operates on char lists in a possibly more unavoidable way. (This issue does not require avoiding the explode for sexps.)
  • For large input, we might want to be even more sophisticated and lex/parse/typecheck incrementally, to catch errors before reading in all the source code. (Obviously this issue doesn't require doing that.)

xrchz avatar Oct 26 '17 05:10 xrchz

Maybe #666 will help?

xrchz avatar Jul 15 '19 17:07 xrchz

Currently, my plan with #666 is to do the following w.r.t. the HOL string type in the lexer:

https://github.com/CakeML/cakeml/blob/834e7c4e926d453bea905223f8affad2e460d1ba/compiler/bootstrap/translation/lexerProgScript.sml#L14-L19

Thus this does not avoid the explode on input to the lexer. I agree that the explode is untidy.

Ideally, we would phrase the lexer as a function that is handed to a fold-function over a finite character stream. The stateful implementation of the fold would then read characters lazily and the implementation would avoid producing the giant string containing all of the input. I guess that's what #645 is aiming for.

myreen avatar Jul 16 '19 11:07 myreen

You could define a type 'a finiteUnfolder, that was a subset of the type

    'a -> (char # 'a) option

such that LFINITE (LUNFOLD f s) held for all s.

I.e., the wellformedness predicate would be

   wfUF (f : 'a -> (char # 'a) option) <=> !a. LFINITE (LUNFOLD f a)

It'd be easy to show that a "string reader" which used the index as the state was such a thing. The finiteness guarantee in the type would mean that the lexer could be parameterised by an arbitrary unfolder. The question is still how you'd guarantee that an input file being read through TextIO would necessarily be finite. (The completely ad hoc approach of cutting off past a limit of 100 MB would probably be fine: i.e., you could prove something like wfUF (limiter n f) where n is an arbitrary number and limiter is effectively just LTAKE.)

mn200 avatar Jul 24 '19 01:07 mn200