dhallj icon indicating copy to clipboard operation
dhallj copied to clipboard

Parser is not stack-safe for deep records

Open travisbrown opened this issue 5 years ago • 2 comments

All operations currently work just fine on arbitrarily long lists:

scala> import org.dhallj.core.Expr
import org.dhallj.core.Expr

scala> import org.dhallj.parser.DhallParser
import org.dhallj.parser.DhallParser

scala> import org.dhallj.core.converters.JsonConverter
import org.dhallj.core.converters.JsonConverter

scala> val longList = DhallParser.parse((0 to 100000).mkString("[", ",", "]"))
longList: org.dhallj.core.Expr.Parsed = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...

scala> longList.normalize
res0: org.dhallj.core.Expr = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...

scala> longList.typeCheck
res1: org.dhallj.core.Expr = List Natural

scala> JsonConverter.toCompactString(longList)
res2: String = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,...

…and most things work just fine on arbitrarily deep record literals (or record types, or union types):

scala> val deepRecord = (0 to 10000).foldLeft(longList: Expr) { 
     |   case (expr, _) => Expr.makeRecordLiteral("foo", expr)
     | }
deepRecord: org.dhallj.core.Expr = {foo = {foo = {foo = {foo = {foo = {foo = ...

scala> deepRecord.normalize.alphaNormalize.hash
res3: String = f41d556f987dd60c59e9b49a367b0bf907dba111c904c88dfda27e2a599a07bc

scala> JsonConverter.toCompactString(deepRecord)
res4: String = {"foo":{"foo":{"foo":{"foo":{"foo":{"foo":{"foo":{"foo":{"foo": ...

Note that dhall produces the same hash for this expression:

$ dhall hash < deep-record.dhall
sha256:f41d556f987dd60c59e9b49a367b0bf907dba111c904c88dfda27e2a599a07bc

Unfortunately the parser can't handle this expression:

scala> DhallParser.parse(deepRecord.toString)
java.lang.StackOverflowError
  at org.dhallj.parser.support.JavaCCParser.jj_ntk_f(JavaCCParser.java:4508)
  at org.dhallj.parser.support.JavaCCParser.BASE_EXPRESSION(JavaCCParser.java:2425)
  at org.dhallj.parser.support.JavaCCParser.RECORD_LITERAL_ENTRY(JavaCCParser.java:390)
  at org.dhallj.parser.support.JavaCCParser.RECORD_LITERAL_OR_TYPE(JavaCCParser.java:666)
  at org.dhallj.parser.support.JavaCCParser.PRIMITIVE_EXPRESSION(JavaCCParser.java:874)
  at org.dhallj.parser.support.JavaCCParser.SELECTOR_EXPRESSION(JavaCCParser.java:1011)
  at org.dhallj.parser.support.JavaCCParser.COMPLETION_EXPRESSION(JavaCCParser.java:1080)
  ...

I think this should just be a matter of doing some more left-factoring, but I'm fairly new to JavaCC and I don't really know how much work this will be, so I've decided not to let this issue block the 0.1.0 release.

travisbrown avatar Apr 12 '20 10:04 travisbrown

@travisbrown: The Haskell implementation had the exact same issue. See this thread starting here: https://github.com/dhall-lang/dhall-haskell/issues/108#issuecomment-425146712.

That issue has an even smaller reproduction which is just ((((((((Bool))))))))

Basically, your intuition is right that the grammar needs to be left-factored. Specifically, the problematic grammar rule is the one for function types:

https://github.com/dhall-lang/dhall-lang/blob/371a43a9aa4bcc0adc91cfd6c9c281a837dace45/standard/dhall.abnf#L758

To see the issue, let's imagine that we're parsing the following simple expression:

Natural

Until the parse hits the end of the file, the parser isn't sure whether or not it is in the middle of parsing a function's input type. For example, the parser doesn't know that the Natural won't be followed by -> Bool or something similar.

Now, imagine that you parse the following expression:

(Natural)

... and now suppose that we temporarily halt the parser after it's done parsing up to here:

(Natural…
        ^

At this point in the expression the parser has to entertain 4 possible branches:

  • The parser is parsing the input of the input of a function type

    e.g. (Natural -> Bool) -> Bool

  • The parser is parsing a function type where the entire function type is parenthesized

    e.g. (Natural -> Bool)

  • The parser is parsing a function type where the function's input type is parenthesized

    e.g. (Natural) -> Bool

  • The parser is parsing a non-function type that is parenthesized

    e.g. (Natural)

More generally, you essentially double the number of possible parses every time you nest things. This leads to a slowdown that is exponential in the nesting level.

The solution is to left-factor the top-level parser so that the parser branches after parsing an expression that might be a function's input type instead of ebfore. I don't know exactly what the Java analog of this, but if you're familiar with Haskell it would mean changing a parser like this:

(parseOperatorExpression *> arrow *> parseExpression) <|> parseOperatorExpression

... to instead be left-factored like this:

parseOperatorExpression *> optional (arrow *> parseExpression)

I've oversimplified things, but that is the basic idea.

Gabriella439 avatar Apr 14 '20 02:04 Gabriella439

Note that the recent precedence change of with (https://github.com/dhall-lang/dhall-lang/pull/954) also has that problem

Nadrieril avatar Apr 14 '20 14:04 Nadrieril