Improve variable name conflict detection in GOOL
This conversation came out of work on the fix to #2179.
In GOOL, there are a number of places where GOOL checks to see if a variable name has been used in order to create a new variable. Examples include:
- List slicing -
tempis used to temporarily store the slice in, andbegIdxandendIdxmay be used to store the beginning or ending index, if they need to be determined at run-time. -
thunkAssign-iis generated for afor-loop. - #3858 - we check to see if it is safe to capitalize the variable name before we do so.
In each of these cases, genVarName or one of its derivatives is used to make sure that the generated variable name has not been used before. For example, if we create GOOL code that declares a variable temp and then does a list slice, the Java code would look something like this:
int temp;
ArrayList<Double> temp0 = new ArrayList<Double>(0);for (int j = 1; j < 4; j += 2) {
temp0.add(myOtherList.get(j));
}
mySlicedList2 = temp0;
This works well, but the problem is it only looks backwards; it doesn't look over the whole program. So if we did the list slice first and then declared the variable, the Java code would look like this:
ArrayList<Double> temp = new ArrayList<Double>(0);for (int j = 1; j < 4; j += 2) {
temp.add(myOtherList.get(j));
}
mySlicedList2 = temp;
int temp;
This would throw an error, since we're declaring temp multiple times.
Ideally we would do two passes so that we know ahead of time which variable names are 'reserved by the user' - i.e. not to be generated.
If we want to do something like this, what approach should we use?
- I know that CodeInfo.hs already does some preprocessing, e.g. I believe it creates the function call map that the Java renderer uses to add
throwsdeclarations to methods. We could look there to see if it's the right place to add this. - @balacij started discussing
stateful rendering in #3864. I'm not exactly sure what that means to be honest, but @samm82 said it might give us the functionality we need.
We had discussed in our meeting that a more concrete example of this would be helpful, so I modified NameGenTest.hs on a separate branch. The GOOL code can be seen here: https://github.com/JacquesCarette/Drasil/blob/5b8e1cd5d1d23aa3346d588d33d99d552092bd88/code/drasil-code/test/NameGenExample.hs#L18-L25
The generated Java code can be seen here: https://github.com/JacquesCarette/Drasil/blob/5b8e1cd5d1d23aa3346d588d33d99d552092bd88/code/stable/gooltest/java/NameGenExample/NameGenExample/NameGenExample.java#L9-L21
As can be seen, the generated names with the for loop avoid conflicts with the first declared variables by appending a 0 to the end of the name. Right after that, however, more user-defined names create conflicts - one with a generated name, and one with another user-defined name.
This was generated by GOOL without any issues, using functionality that has been in place for a while now (I modified the listSlice implementation, but the way it's used here wasn't affected). It shows that while our generated names attempt to be unique, they aren't perfect; and that when conflicts happen GOOL doesn't do anything about it.
From our meeting, we came up with two solutions:
- We can modify GOOL to throw an error if two variables are ever given the same name. We could modify the implementation of each declaration function for each renderer, or better yet we might be able to modify the useVarName implementation, which would catch everything in one place.
- This would ensure that generated code is 'safe' in this respect, but it would not be user-friendly. We don't really want to get an error if a user-defined variable name conflicts with an auto-generated one, and instead it would be better to generate a different name. Still, it's better that GOOL throws an error than the generated code throwing an error when they try to run it, which is what currently happens.
- A more robust solution would be to add some sort of 'initial pass' where we get a complete set of every user-defined variable name. This would allow us to only throw an error if the user made a mistake, i.e. they declared two variables with the same name. This would take more work, but is probably a better long-term solution.
- I need to look more into how we'd do this, but my initial thoughts were outlined in my first comment above.
Amusingly, this is somewhere we could use Haskell's laziness to our advantage: we could lazily generate names and only force them to be actually generated once the full block of code has been processed, so that it would then have the full context in place in which to do correct unique generation. I've definitely seen some code "out there" that does this (I think I recall code for the generation of labels for a one-pass generation of assembly-level code).
Great work @B-rando1. You took what we discussed in the meeting and showed results very quickly! I like the intermediate step of your first option - showing an error if two variables have the same name. From our discussion yesterday, this doesn't seem to be that difficult a change, and it would make GOOL safer. I'm not sure if it is even that bad from the user's point of view, assuming that the error message is descriptive. The user might want control over what their variable is called, instead of relying on the system making a unique name. If we do automatically make unique names in the future from the user input, there should be a log message that lets the user know, or they might be confused by the generated code.
@JacquesCarette that's a cool idea! I'm not quite sure what we need to do to get it there. My understanding is that Haskell is lazy by default, but some operations will force evaluation at a certain point. Is there something we need to change to allow Haskell to defer name generation?
In theory, nothing to do. In practice, you need to make sure you're not doing certain pattern matches "too early" as that will force evaluation. Plus someone's work (Noah?) on making things strict will definitely interact with this. https://wiki.haskell.org/Maintaining_laziness is a useful resource. I think http://comonad.com/reader/2014/fast-circular-substitution/ is one example of what I mean.
I tried locating the issue by using ['a' | _ <- [0..]] in a few places regarding genVarName. This allowed me to find the following:
- genVarName is not lazy.
-
splitVarName is not lazy.
- I saw it uses
reverse, which the Haskell wiki page warned against. Maybe that's the reason.
- I saw it uses
-
Map.lookupis lazy. -
bumpVarName is not lazy.
- Nothing jumped out at me too hard here.
-
Data.Stateis lazy as far as I can tell, though it was hard to check so I might be wrong.
I'll need to do some more digging tomorrow to see what the root causes are, and if there's a good fix for them.