otp icon indicating copy to clipboard operation
otp copied to clipboard

Add a new operator ^ for pinning of pattern variables

Open richcarl opened this issue 4 years ago • 57 comments

This allows you to annotate already-bound pattern variables as ^X, like in Elixir. This is less error prone when code is being refactored and moved around so that variables previously new in a pattern may become bound, or vice versa, and makes it easier for the reader to see the intent of the code. An new compiler flag 'warn_unpinned_vars' (disabled by default) provides warnings for all uses of already bound variables in patterns that have not been marked with ^.

Note that this branch is a preview, and is based on another branch (#2944) to avoid future complications. Only the last commit pertains to this PR.

richcarl avatar Dec 24 '20 19:12 richcarl

Hi @richcarl, I have a couple questions based on places where we need to handle ^ specially in Elixir:

  1. What happens when you are binding the variable for the first time but it is repeated?

     {X, X} = {ok, ok}
    

    In Elixir we don't require ^ because ^ is meant to bind to the previous value before the assignment and there is none here. This is also to avoid left-to-right semantics in pattern matching. For example, pinning only the right one would be weird.

  2. There are two places in patterns that support guard expression instead of patterns: the size expression in binaries and map keys. Are you warning and requiring variables to be bound for them? So you have to write:

     %{^X => Value} = Map,
    

    In binaries, you will only warn if you are matching on a previous value:

     Size = ...,
     <<Foo:^Size>>,
    

    But not on this:

     <<X:X>>
    

Regardless of the direction to go, you should probably have tests around those scenarios.


There is another warning we have added to Elixir which I found extremely helpful which is to warn if an underscored variable is matched or repeated in a pattern. For example, if you are matching on Erlang AST, you may write:

some_fun({match, _Anno, {var, _Anno, Var}, _}) ->

And then this doesn't work because the underscored _Anno variables need to match. This will provide similar guarantees to the pin variable but only for underscored variables - but that's something I would say can be done by default because 99% of the cases where this happens, it is accidental.

josevalim avatar Dec 25 '20 07:12 josevalim

  1. What happens when you are binding the variable for the first time but it is repeated?

     {X, X} = {ok, ok}
    

    In Elixir we don't require ^ because ^ is meant to bind to the previous value before the assignment and there is none here. This is also to avoid left-to-right semantics in pattern matching. For example, pinning only the right one would be weird.

The current semantics does not change, so the meaning is still that these are both instances of the same new variable, with the constraint that it has to have the same value in both positions. (The compiler implements this using temporary names for both fields, adds a guard to check that they are equal, and then if they match, the value is assigned to X.)

Pinning does not apply here, because that says "evaluate this thing in the environment surrounding the pattern", and there is no previous X there. There is also no specific left/right evaluation order in patterns: conceptually, both variable instances are bound in parallel given the same current environment, and then they are checked for equality.

richcarl avatar Dec 25 '20 12:12 richcarl

  1. There are two places in patterns that support guard expression instead of patterns: the size expression in binaries and map keys. Are you warning and requiring variables to be bound for them? So you have to write:
     %{^X => Value} = Map,
    

These subexpressions have the semantics of guard expressions. Their variables are already evaluated in the surrounding environment and can never be previously unbound, so pinning is never required (or even allowed). But you are right that I should add some more test cases.

richcarl avatar Dec 25 '20 12:12 richcarl

There is another warning we have added to Elixir which I found extremely helpful which is to warn if an underscored variable is matched or repeated in a pattern. For example, if you are matching on Erlang AST, you may write:

some_fun({match, _Anno, {var, _Anno, Var}, _}) ->

And then this doesn't work because the underscored _Anno variables need to match. This will provide similar guarantees to the pin variable but only for underscored variables - but that's something I would say can be done by default because 99% of the cases where this happens, it is accidental.

I think this is a good idea (the AST example has bitten me once or twice), but that's for another PR.

richcarl avatar Dec 25 '20 12:12 richcarl

A new EEP 55 was added to consider inclusion of this change: https://github.com/erlang/eep/blob/master/eeps/eep-0055.md

alavrik avatar Jan 11 '21 14:01 alavrik

Doesn't the purpose of this get a bit broken when the operator is optional? Would it not match that fact better if unbound variables can be annotated to ask the compiler to tell you if it becomes an unintentional match instead?

attah avatar Jan 12 '21 15:01 attah

Doesn't the purpose of this get a bit broken when the operator is optional?

No, that's the way it needs to be introduced, to allow people with existing codebases to start using the new feature and preparing their code, but not breaking their builds with new warnings unless they enable the flag. In a subsequent major release, the warnings can be made on by default and off by option instead.

Would it not match that fact better if unbound variables can be annotated to ask the compiler to tell you if it becomes an unintentional match instead?

No, unbound variables is by far the most common case in patterns, being how you introduce new variable bindings in Erlang. You don't want to annotate each and every one of those.

richcarl avatar Jan 14 '21 12:01 richcarl

As others have said: for Elixir this operator is essential, since they
rebind variables without it.

For Erlang, if using a pinning operator had been required from the start;
I think that would have been a bit better than the current "match
if already bound". It is hard to be sure by looking at the code
if the variable is already bound - you have to make a machine search.

Introducing a pinning operator now is trickier...

Having a compiler option to choose if pinning is allowed/required makes it
hard to know what to expect from the code. The compiler option is set in
some Makefile far away from the source code.

I think I would prefer that instead there should be a compiler pragma
(I wish it would not be allowed from an include file but that is probably
impossible to enforce) so it is visible in the current module what to
expect about operator pinning. Without the pragma the pinning operator is
not allowed, with it pinning is mandatory; not a warning - an error if
a pinning operator is missing.

You get the idea: it should be possible from the source code how to read
it, at least on the module level.

How to take the next step i.e when code not using pinning is the exception,
to remove the compiler pragma, I have not thought about yet...

RaimoNiskanen avatar Jan 14 '21 12:01 RaimoNiskanen

No, that's the way it needs to be introduced, to allow people with existing codebases to start using the new feature and preparing their code, but not breaking their builds with new warnings unless they enable the flag. In a subsequent major release, the warnings can be made on by default and off by option instead.

I would absolutely hate that stated end state. I don't necessarily have a cogent argument there, if only that the base pattern matching semantics behind Erlang are one of the things I found the most useful and impressive about the language and getting away from there is something I can't find myself agreeing with. It's the kind of thing other languages do and then I'm forced to use them and think "it's much nicer in Erlang."

ferd avatar Jan 14 '21 13:01 ferd

stated end state

And I think this is the one of the most important aspect to discuss. If this is to be added but there is no agreement on the end state, then I would say it is best not to add it in the first place, rather than having two ways of writing the same code based on a compiler flag (or a pragma).

josevalim avatar Jan 14 '21 13:01 josevalim

I don't quite see what it is that you're against, though, Fred. On average, you'll only find a couple of instances of already-bound variables in any module (and the relative rarety is what makes it so much harder to spot when it actually is an important detail of the algorithm). You'd only need to mark these up as being intentional, and everything else stays exactly the same.

    /Richard

Den tors 14 jan. 2021 kl 14:23 skrev Fred Hebert [email protected]:

No, that's the way it needs to be introduced, to allow people with existing codebases to start using the new feature and preparing their code, but not breaking their builds with new warnings unless they enable the flag. In a subsequent major release, the warnings can be made on by default and off by option instead.

I would absolutely hate that stated end state. I don't necessarily have a cogent argument there, if only that the base pattern matching semantics behind Erlang are one of the things I found the most useful and impressive about the language and getting away from there is something I can't find myself agreeing with. It's the kind of thing other languages do and then I'm forced to use them and think "it's much nicer in Erlang."

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/erlang/otp/pull/2951#issuecomment-760194018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABPAAOGLKN2GWWMRTHEIDTSZ3V5RANCNFSM4VIPVJ2A .

richcarl avatar Jan 14 '21 13:01 richcarl

I don't quite see what it is that you're against, though, Fred. On average, you'll only find a couple of instances of already-bound variables in any module (and the relative rarety is what makes it so much harder to spot when it actually is an important detail of the algorithm).

Not true in tests. And tests are often included in the same module as the code when using eunit so you can't just disable this for tests.

essen avatar Jan 14 '21 13:01 essen

I see how a test suite with a large number of lines with matches containing one or more already bound variables could be painful to update by hand, but on the whole I think even this kind of code would benefit from being explicit about when you're referring to a known value. It's a pity if a test stops working because you renamed a variable further up and the match-as-assert now simply passes quietly.

Also for a person reading the test suite, it is hard to see what is actually being tested, until they have parsed the code enough to figure out which variables are already bound.

    /Richard

Den tors 14 jan. 2021 kl 14:56 skrev Loïc Hoguin [email protected]:

I don't quite see what it is that you're against, though, Fred. On average, you'll only find a couple of instances of already-bound variables in any module (and the relative rarety is what makes it so much harder to spot when it actually is an important detail of the algorithm).

Not true in tests. And tests are often included in the same module as the code when using eunit so you can't just disable this for tests.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/erlang/otp/pull/2951#issuecomment-760212305, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABPAAIRZN3WMBCE52V2MGTSZ3Z2FANCNFSM4VIPVJ2A .

richcarl avatar Jan 14 '21 14:01 richcarl

I don't quite see what it is that you're against, though, Fred. On average, you'll only find a couple of instances of already-bound variables in any module (and the relative rarety is what makes it so much harder to spot when it actually is an important detail of the algorithm). You'd only need to mark these up as being intentional, and everything else stays exactly the same.

I generally like the power of Prolog-style unification and feel that moving towards it would be more desirable than moving away from it. I'd rather have to explicitly mark cases where you want to re-bind and be told "look, this one isn't a match", than cases where you don't want to rebind, and therefore to me this EEP moves in a direction opposite to what I'd prefer.

This is a stance that I take in a vacuum about my own preferences, where I don't have to care for other people's opinions and how confused they might feel by the unfamiliar semantics. Their views are valid and I can understand them, but I won't spend the energy to defend them here. If everything had to be in line with what felt the most welcoming and easy, we'd all be stuck using Go right now and mostly always building from a foundation of familiarity that wouldn't have given rise to Erlang itself; on the other hand without that perspective, there's probably gonna be a more limited community and adoption wouldn't pick up at all, and the sustainability would be in jeopardy.

It's a difficult balance to strike, and I don't know where the proper tradeoff is to be done. So amidst that uncertainty, I'm just voicing my own preferences.

ferd avatar Jan 14 '21 14:01 ferd

Den tors 14 jan. 2021 kl 15:28 skrev Fred Hebert <[email protected]

:

I generally like the power of Prolog-style unification and feel that moving towards it would be more desirable than moving away from it. I'd rather have to explicitly mark cases where you want to re-bind and be told "look, this one isn't a match", than cases where you don't want to rebind, and therefore to me this EEP moves in a direction opposite to what I'd prefer.

Sounds like you're focusing on the effects in fun heads and list comprehensions, where ^-annotation would add possibilities which don't exist today. But this is far from the main point of the suggestion, and could even be dropped (but it would be an assymmetry if ^ was not allowed in those patterns as well). But for the common case of patterns in functions, =, and case/if/receive/try, there is no shadowing involved, and yet that is where ^-annotations are most important. Without annotations, you cannot read and understand a pattern until you have parsed enough of the context to know which variables are already bound - is it an assertion or just a binding? And the status of this can change without warning if someone does a simple renaming or adds what they thought was a new variable further up. When you're looking at a dozen pull requests every day, this sort of boobytrap is not fun to have around.

Automatic matching on already bound variables can make short pieces of code quite elegant, but I have come to see this sort of thing as a cuteness that a language for large systems with many developers cannot afford - and I think it's also one of the little unnecessary obstacles that can turn off beginners who are trying to make sense of the examples they stumble over: "oh, you should simply have noted that X was bound up there, better luck next time".

    /Richard

richcarl avatar Jan 14 '21 15:01 richcarl

Automatic matching on already bound variables can make short pieces of code quite elegant, but I have come to see this sort of thing as a cuteness that a language for large systems with many developers cannot afford - and I think it's also one of the little unnecessary obstacles that can turn off beginners who are trying to make sense of the examples they stumble over: "oh, you should simply have noted that X was bound up there, better luck next time".

Absolutely agree here. This happens all the time in substantial code-bases. I even know of developers that have started favouring guards (ugh) for matching on previously bound values to avoid confusion of whether a value in a match is bound or not or if it should be or not. This way it match can be made explicit which is really valuable.

kjnilsson avatar Jan 14 '21 16:01 kjnilsson

Sounds like you're focusing on the effects in fun heads and list comprehensions, where ^-annotation would add possibilities which don't exist today. But this is far from the main point of the suggestion, and could even be dropped (but it would be an assymmetry if ^ was not allowed in those patterns as well). But for the common case of patterns in functions, =, and case/if/receive/try, there is no shadowing involved, and yet that is where ^-annotations are most important. Without annotations, you cannot read and understand a pattern until you have parsed enough of the context to know which variables are already bound - is it an assertion or just a binding?

I don't really find too many of these "but what if you are reading code out of context" situations realistic. The first thing I always do in a code base is trying to figure out context because it drives everything. Performance optimizations are premature unless contextualized, fixes may break things at a distance unless they're contextualized, and naming of variables can be confusing unless they're contextualized in the domain they steep.

I never gave a crap about the arguments about "I'm refactoring and swapping lines and the , ; . stuff makes it hard to do" precisely because a change in return value never happens in isolation and always has to consider the caller to work. I won't give a crap about the argument of whether a thing is a rebinding match or not because whether it is a rebinding match or not will still need to be placed in its current context to know if a thing is useful or not, and if it's systematic I'm just asked to always provide markers for that context and intent that would be implicit otherwise.

Nothing we ever do here will remove the need for me to understand context. It will ask me to always declare my intent which arguably can be beneficial at times, but in my 10 years of Erlang I can remember 1 bug where this was very significant in a rare edge case situation (one that was too deeply nested and tricky to trigger for it to have been properly tested).

It's very possible other people feel this is very significant and important, but I personally don't.

And the status of this can change without warning if someone does a simple renaming or adds what they thought was a new variable further up. When you're looking at a dozen pull requests every day, this sort of boobytrap is not fun to have around. Automatic matching on already bound variables can make short pieces of code quite elegant, but I have come to see this sort of thing as a cuteness that a language for large systems with many developers cannot afford - and I think it's also one of the little unnecessary obstacles that can turn off beginners who are trying to make sense of the examples they stumble over: "oh, you should simply have noted that X was bound up there, better luck next time".

I haven't found that to be a significant case personally. Again, I'm not denying that experience in other people, but I'm speaking from my own position of comfort from this process. I likely have internalized the pitfall enough that I never mind it anymore.

Either way, the better systematic fix here if we really want to address the unexpected surprises would be to make these conditional expressions with matches have their own scope. I have found that almost everyone I introduced to the language just assumed that the scope of a new conditional was a new scope the way they would be in C-like languages, and that the problem with the pattern match was one of being surprised of that scope.

Arguably, introducing per-expression scopes would remove the very rare "I'm exporting the variable from all clauses so I can use it outside of the expression" case, bring in the shadowing warnings of funs and list comprehensions to case/if/receive/try, and make the use of a ^ pinning marker extremely obvious in intent with warnings otherwise. It would remove any need to specify stuff unless there is an actual risk of confusion (because the scoping rules tell you so), and require the least amount of work to get much clearer semantics. I've got this gut feeling that this current EEP hopes to get to that final result without actually fixing the scoping issues that actually underpin the perceived problem, and they'll be making me do work for the compiler that language semantics could instead address better than by having me toggle an operator every time.

ferd avatar Jan 14 '21 16:01 ferd

I no kidding have had a lot more cases where the problem was that one of the thing I matched in the clause of a case expression is something I wanted to use externally and had to rename variables inside the case to not clash with usage after the case than when it happens beforehand, and per-expression scope there would fix that on top of the perceived problem behind this EEP. It would be a much more general solution.

Fixing the scoping creates a need for the operator. It feels backwards to create the operator to address the needs you'd have if you had the actual scoping in place, and doing one without the other feels misguided.

ferd avatar Jan 14 '21 16:01 ferd

I suspect people who find ^ most useful are also those who have to deal with larger function clauses. I do not see this being useful for my own code bases, but this is most likely very appealing for other code bases.

essen avatar Jan 14 '21 16:01 essen

@ferd wrote:

the problem was that one of the thing I matched in the clause of a case expression is something I wanted to use externally and had to rename variables inside the case to not clash with usage after the case than when it happens beforehand,

I also encounter this every now and then, which raises a question: If you have Y partially matched in a case (not from all clauses), after the case end using ^Y would fail compilation, but a Y could be seen as either a) a deliberately new variable with no compile time problem, or, b) as both using a partially matched variable and missing the pinning operator.

I think a) could work since if you one day would match Y in all clauses you would get an error for missing the pinning operator. Ugly, though, since to utilize this you would have to conciously make sure Y is not matched in all clauses, so b) is probably the safe choice...

RaimoNiskanen avatar Jan 14 '21 16:01 RaimoNiskanen

Either way, the better systematic fix here if we really want to address the unexpected surprises would be to make these conditional expressions with matches have their own scope. I have found that almost everyone I introduced to the language just assumed that the scope of a new conditional was a new scope the way they would be in C-like languages, and that the problem with the pattern match was one of being surprised of that scope.

Arguably, introducing per-expression scopes would remove the very rare "I'm exporting the variable from all clauses so I can use it outside of the expression" case, bring in the shadowing warnings of funs and list comprehensions to case/if/receive/try, and make the use of a ^ pinning marker extremely obvious in intent with warnings otherwise. It would remove any need to specify stuff unless there is an actual risk of confusion (because the scoping rules tell you so), and require the least amount of work to get much clearer semantics. I've got this gut feeling that this current EEP hopes to get to that final result without actually fixing the scoping issues that actually underpin the perceived problem, and they'll be making me do work for the compiler that language semantics could instead address better than by having me toggle an operator every time.

You're getting ahead of me, but that's definitely something I and the team at WhatsApp have been discussing. But it would be a really big change to the language, and to have any hope of succeeding it would need to be done in stages, with a smooth migration path. The pinning stage is a good first step, making the meaning of each pattern exactly clear regardless of whether shadowing is used or not. I'll be following this up with other suggestions, and local scopes is one of the things we've been trying out - very little code actually uses variables that are exported from subexpressions. But if all of that was done in a big bang change, it might as well be a fork of the language because of all the adoption problems.

richcarl avatar Jan 15 '21 13:01 richcarl

@ferd wrote:

I generally like the power of Prolog-style unification and feel that moving towards it would be more desirable than moving away from it. I'd rather have to explicitly mark cases where you want to re-bind and be told "look, this one isn't a match", than cases where you don't want to rebind, and therefore to me this EEP moves in a direction opposite to what I'd prefer.

As I have understood Prolog (never used it myself), unification is quite different from Erlang's first bind then match approach, with more symmetry between the variable occurences.

I think it would be reasonable to have a syntax for matching, as opposed to binding, since they are fundamentally different operations (in contrast to in Prolog), and matching is the rare one.

RaimoNiskanen avatar Jan 15 '21 14:01 RaimoNiskanen

For reference, a discussion on this very topic is in progress on the erlang-questions mailing-list.

galdor avatar Jan 15 '21 15:01 galdor

stated end state

And I think this is the one of the most important aspect to discuss. If this is to be added but there is no agreement on the end state, then I would say it is best not to add it in the first place, rather than having two ways of writing the same code based on a compiler flag (or a pragma).

With regard to end state, I would be happy to see this be added but remain an optional annotation. It requires no changes to existing codebases as I understand it - which means that individual developers or teams can be free to adopt it should it be a net increase of clarity in their particular case.

HoloRin avatar Jan 15 '21 15:01 HoloRin

Great to see some feedback from @josevalim. Pinning from my experience is very useful to have in Elixir.

This feature certainly would have saved team RabbitMQ hours in investigations of some subtle bugs. And sure, there were larger functions involved. It's reasonable to expect that some functions will need some nesting of cases, and there are functions that accept anonymous functions as arguments. So shadowing is a very real thing in Erlang, even in relatively small modules. And so are the pains and time waste associated with it.

Oh, and the "unified Prolog syntax" argument made me literally LOL. Erlang has a certain reputation in our industry for its syntax. Whether justified or not, let us be mindful of it. Unless we want the industrial Erlang user community to eventually reach the size of that of Prolog!

michaelklishin avatar Jan 15 '21 16:01 michaelklishin

I don't like this idea at all, the very existence of such a proposal makes me upset. Instead, I'd like Elixir to fail on x = x + 1 because there is no variable, it is a binding, and it is not an assingment, it is a pattern matching, and this is a clear, concise default.

Adding this as an optional operator makes things even worse in my opinion, because it adds ambiguity and as a consequence, unnecessary complexity. Please keep it simple.

fisher avatar Jan 15 '21 16:01 fisher

@fisher then you'd be happy that this proposal does not make this work, moreover it would make this fail at compile-time, rather than runtime as it does today. The compiler would warn you that if you indeed want to match (preserving current Erlang semantics) you should have used ^X = X + 1.

Observing the conversation around this proposal here and on the mailing list, I have a general remark. It seems that large number of people commenting have not read through the proposal and are opposing or commenting about things completely unrelated. This is not helpful.

michalmuskala avatar Jan 15 '21 18:01 michalmuskala

@fisher how does an explicit, opt-in way to express the exact intent of the developer add meaningfully to the complexity? What really does add to the practical complexity of using Erlang is the way things work today. Several members of my team can attest to that.

michaelklishin avatar Jan 15 '21 20:01 michaelklishin

I agree with ROK's view "Erlang functions are as a rule small enough that you shouldn't ever shadow a variable". I understand this proposal was created due to difficulty during Erlang development with people that have an Elixir background as a way of attempting to avoid bugs when developing large Erlang functions (born from confusion about variable use when moving from Elixir to Erlang).

I believe this direction is dangerous because a developer should naturally be able to reduce the complexity of functions for both clearer communication and easier maintenance without a new syntax that encourages complexity in functions (encouraging complexity in Erlang source code would be a language design failure).

okeuday avatar Jan 15 '21 22:01 okeuday

Several members of my team can attest to that.

Oh how I love this type of reasoning referring to the tastes of a large amount of flies. No more comments.

fisher avatar Jan 15 '21 23:01 fisher