dhallj
dhallj copied to clipboard
Parser is not stack-safe for deep records
All operations currently work just fine on arbitrarily long lists:
scala> import org.dhallj.core.Expr
import org.dhallj.core.Expr
scala> import org.dhallj.parser.DhallParser
import org.dhallj.parser.DhallParser
scala> import org.dhallj.core.converters.JsonConverter
import org.dhallj.core.converters.JsonConverter
scala> val longList = DhallParser.parse((0 to 100000).mkString("[", ",", "]"))
longList: org.dhallj.core.Expr.Parsed = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...
scala> longList.normalize
res0: org.dhallj.core.Expr = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
scala> longList.typeCheck
res1: org.dhallj.core.Expr = List Natural
scala> JsonConverter.toCompactString(longList)
res2: String = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,...
…and most things work just fine on arbitrarily deep record literals (or record types, or union types):
scala> val deepRecord = (0 to 10000).foldLeft(longList: Expr) {
| case (expr, _) => Expr.makeRecordLiteral("foo", expr)
| }
deepRecord: org.dhallj.core.Expr = {foo = {foo = {foo = {foo = {foo = {foo = ...
scala> deepRecord.normalize.alphaNormalize.hash
res3: String = f41d556f987dd60c59e9b49a367b0bf907dba111c904c88dfda27e2a599a07bc
scala> JsonConverter.toCompactString(deepRecord)
res4: String = {"foo":{"foo":{"foo":{"foo":{"foo":{"foo":{"foo":{"foo":{"foo": ...
Note that dhall produces the same hash for this expression:
$ dhall hash < deep-record.dhall
sha256:f41d556f987dd60c59e9b49a367b0bf907dba111c904c88dfda27e2a599a07bc
Unfortunately the parser can't handle this expression:
scala> DhallParser.parse(deepRecord.toString)
java.lang.StackOverflowError
at org.dhallj.parser.support.JavaCCParser.jj_ntk_f(JavaCCParser.java:4508)
at org.dhallj.parser.support.JavaCCParser.BASE_EXPRESSION(JavaCCParser.java:2425)
at org.dhallj.parser.support.JavaCCParser.RECORD_LITERAL_ENTRY(JavaCCParser.java:390)
at org.dhallj.parser.support.JavaCCParser.RECORD_LITERAL_OR_TYPE(JavaCCParser.java:666)
at org.dhallj.parser.support.JavaCCParser.PRIMITIVE_EXPRESSION(JavaCCParser.java:874)
at org.dhallj.parser.support.JavaCCParser.SELECTOR_EXPRESSION(JavaCCParser.java:1011)
at org.dhallj.parser.support.JavaCCParser.COMPLETION_EXPRESSION(JavaCCParser.java:1080)
...
I think this should just be a matter of doing some more left-factoring, but I'm fairly new to JavaCC and I don't really know how much work this will be, so I've decided not to let this issue block the 0.1.0 release.
@travisbrown: The Haskell implementation had the exact same issue. See this thread starting here: https://github.com/dhall-lang/dhall-haskell/issues/108#issuecomment-425146712.
That issue has an even smaller reproduction which is just ((((((((Bool))))))))
Basically, your intuition is right that the grammar needs to be left-factored. Specifically, the problematic grammar rule is the one for function types:
https://github.com/dhall-lang/dhall-lang/blob/371a43a9aa4bcc0adc91cfd6c9c281a837dace45/standard/dhall.abnf#L758
To see the issue, let's imagine that we're parsing the following simple expression:
Natural
Until the parse hits the end of the file, the parser isn't sure whether or not it is in the middle of parsing a function's input type. For example, the parser doesn't know that the Natural won't be followed by -> Bool or something similar.
Now, imagine that you parse the following expression:
(Natural)
... and now suppose that we temporarily halt the parser after it's done parsing up to here:
(Natural…
^
At this point in the expression the parser has to entertain 4 possible branches:
-
The parser is parsing the input of the input of a function type
e.g.
(Natural -> Bool) -> Bool -
The parser is parsing a function type where the entire function type is parenthesized
e.g.
(Natural -> Bool) -
The parser is parsing a function type where the function's input type is parenthesized
e.g.
(Natural) -> Bool -
The parser is parsing a non-function type that is parenthesized
e.g.
(Natural)
More generally, you essentially double the number of possible parses every time you nest things. This leads to a slowdown that is exponential in the nesting level.
The solution is to left-factor the top-level parser so that the parser branches after parsing an expression that might be a function's input type instead of ebfore. I don't know exactly what the Java analog of this, but if you're familiar with Haskell it would mean changing a parser like this:
(parseOperatorExpression *> arrow *> parseExpression) <|> parseOperatorExpression
... to instead be left-factored like this:
parseOperatorExpression *> optional (arrow *> parseExpression)
I've oversimplified things, but that is the basic idea.
Note that the recent precedence change of with (https://github.com/dhall-lang/dhall-lang/pull/954) also has that problem