roc Parser indentation issues

While diving into #3445, I discovered and started to understand just how strange indentation currently is in the roc parser.

Briefly, there are two intertwined concepts thru-out the code:

min_indent
start_column

These are not handled super consistently, sometimes change roles, and generally lead to unexpected behavior of roc indentation.

start_column generally means "what column was the State at when started parsing?"

min_indent generally starts at 0 at the top level, and as we parse each child of a given node, we usually either:

(a) initialize min_indent=0 at the root
(b) pass the current min_indent unchanged
(c) pass min_indent + 1, indicating that things should be indented further
(d) pass start_column

There are only a few places where we check min_indent, and as far as I know there's nowhere that we check start_column directly.

In those places that we do check min_indent, mostly in blankspace.rs when looking at indents, and we do this by comparing it to state.col() - i.e. the current column.

Thinking abstractly about chains of parent -> child -> grandchild (etc) nodes we see while parsing - e.g. root -> def -> expr -> list -> expr -> list, in the case of something like:

foo = [ [ ] ]
         ^ parser state here

And, along that chain, we will have passed min_indent / start_column down at each stage using the rules above, ending up with this chain of computation (for example)

(set min_indent=0) -> (pass min_indent+1) -> (capture start_column from state) -> (pass current min_indent) -> (pass start_column)

The way this can go wonky, is where we have on or more +1s that happened along the path, and now (at the end of the chain, deep in blankspace.rs), we're checking min_indent. At that point, abstractly, min_indent will look something like start_column+1+1+1 - or 0+1+1+1, depending on whether min_indent was last initialized from 0 or from a previous state.col().

Alarm bells should already be going off, but if they're not, here's the key: while +1+1+1 may be what happened to min_indent while we were parsing, the user may have sprinkled in several newlines and multiple indents of more than 1 space each.

Abstractly speaking, the problem is that start_column is in units of columns, and the +1 is really in units of "indent levels". The two of these are generally not compatible.

Here's an example of the current way this can go wrong:

my_list = [
 0, # currently accepted in the parser! min_indent=1 here
    [
 1, # min_indent=1 here too!!!!
    ],
]

(note, those are indented by 1 space)

Because of this units mixup, things that "look" like de-dents to a user can still be accepted as indents.

To fix this, I propose that all indentation needs to be changed to have consistent units. Specifically, it all needs to be in columns, or in indent levels. "columns" would make roc feel closer to haskell/etc, and "indent levels" would make it feel closer to python/etc.

Columns

If we go with "columns", what this would look like is removing all the +1s, and replacing them with capturing the current state.col(), and requiring child nodes to all have an indent strictly greater than the start_column - not >= like the comparison currently is. This plan renames all min_indents to start_column, and removes one of those in the case that both are currently passed.

With column-based indentation, the example would need to look like this:

my_list = [
               0, # indented past the start of the list (the `[`)!
               [
                  1,
               ],
          ]

Indent levels

If we go with "indent levels", this would mean requiring that lines be indented to a multiple of (n) spaces, setting n=4 for example. Ideally we'd also verify that we never "skip" a level, i.e. the comparison actually becomes = instead of >=.

With "indent levels", the example would look like this:

my_list = [
    0, # strictly 4 spaces of indent more than the parent scope
    [
        1,
    ],
]

Do those choices make sense?

Ultimately I think "indent levels" will be familiar to more developers coming from the c/c++/rust/javascript/python/ruby side, but "columns" will be more familiar to folks with a haskell/(etc?) background.

If we go with "columns", I'd also suggest introducing an indent rule that list/collection items must all be indented the same amount - but that can be implemented separately.

Thoughts?

Jul 28 '22 02:07 joshuawarner32

I've also noticed these inconsistencies, can you make a topic on zulip about columns vs indent levels @joshuawarner32 ? This seems like an issue where we want to hear many different points of view.

Jul 29 '22 10:07 Anton-4

Some more updates:

I had posted to slack a while ago (~months) and IIRC got a loose direction of "indent levels", with the justification that it'll be more familiar to folks that come from more mainstream languages (python, but also even non-significant-whitespace languages like javascript and rust are often indented this way). Shotly after that I lost steam for a while (life! but it's all good!) - and I'm just coming back to this now.

Recently I fixed #4139 in a way that seems highly related to this discussion - in particular because I realized that what I was really describing above (columns versus indent levels) is really two ends of what is actually a spectrum - and the solution I landed on for #4139 is somewhere in the middle - and IMO basically captures the best of both worlds.

This new rule, which I'm just going to call "rule 3" since I think any self-describing name would just be confusing, is a bit of a blend of the two. The idea is to use a precise number of columns of indent - but unlike the "columns" suggestion above, we take this column number from the beginning of the line that the parent term starts on, rather than from the column of the start token of the given grammar production (that'd be the [ for lists, in the examples above - or the \ for lambdas).

The primary effect here is that users coming from either the haskell or the python world can (with a few tiny exceptions) use the indentation style they're used to and things will "just work". We can choose a consistent indent policy for Roc and have the formatter normalize everything to that (I'm rather ambivalent on what that policy is, personally).

Python-like indentation (fixed four-space indent):

"a string"
|> Str.toUtf8
|> List.map \byte ->
    byte + 1
|> List.reverse

Haskell-like indentation (indent past the beginning token of the parent):

"a string"
|> Str.toUtf8
|> List.map \byte ->
             byte + 1
|> List.reverse

One downside of this approach versus the haskell-style indentation is that, if there are multiple things on the previous line that could be continued, haskell indentation can indicate unambiguously which one is intended. With this strategy (or, indeed, the pythonic strategy), we can only choose the innermost grammar production as the thing to attach a continuation to.

As a concrete (if contrived) example:

"a string"
|> Str.toUtf8
|> List.map \byte -> when byte is 0 -> 0
                                  b -> b
              + 1
|> List.reverse

Under "rule 3", the + 1 attaches to the innermost expression it could - which gives b + 1 - rather than what's suggested by the indentation (from a haskell perspective) - which would be the whole when ... expression, giving (when ...) + 1. Keep in mind this is also true in the status quo, since there are only a few locations that the roc parser actually checks the min_indent.

This implies that, at a minimum, the formatter should indent the + 1 to make it clear where it attaches - so using haskell style indentation, at least in cases that might otherwise be ambiguous.

What I'd like to propose is that we make "rule 3" the way indentation works in Roc, without exception. Specifically there are a few place that try to apply haskell-style indentation which I'd like to relax to also be "rule 3".

These are:

when, which has two requirements: (1) branches/patterns are indented beyond the when, and (2) they're indented the same (across all branches)
expect, which has the same rule about its inner expression
"demands" like hash : a -> U64 | a has Hash, which has some rather complicated indent logic (I need to do more of a deep dive here)

Some examples of what would be allowed under "rule 3":

fib = \x -> when x is
    0 -> 1
    1 -> 1
    n -> fib(n-1) + fib(n-2)

(branches no longer need to be indented past the when)

Hash has
  hash : a 
    -> U64

(the -> no longer needs to be indented past the : - only indented relative to the line that the : is on)

And more generally - any production in the grammar would:

Check the indent level (in columns) of the first character of the line that production starts on
Require that every token (after the first token) has a column strictly greater than that indent level.

Oct 11 '22 03:10 joshuawarner32