unison icon indicating copy to clipboard operation
unison copied to clipboard

Fix `unique` type update churn

Open aryairani opened this issue 3 years ago • 11 comments

A unique type definition not deleted from a scratch file will always be treated as an update (as opposed to an existing/known definition) after an add. See #2195.

aryairani avatar Jul 11 '21 21:07 aryairani

@pchiusano Could you take a look at what it would take to achieve this?

aryairani avatar Jul 11 '21 21:07 aryairani

Option 1

The simplest fix for this: the pretty-printer just adds the GUID to the declaration it generates when a definition is dumped to a file.

unique[someguid] type Direction = North | South | East | West

This is accurate (the declaration has this info available) but ugly. I don't know that we want to expose GUIDs like this as part of the programmer interface.

Option 2

If we don't want to do that, doing a lookup of "the structure matches" would require that we have some index of decl structure to decl. Then in theory we could lookup the decl structure, and if there's a unique match, we replace that definition with its canonical version.

What if there are multiple unique types that share a structure though? (This is likely, for instance, there are many newtypes around Text) Then I suppose you can use some heuristic, checking for names that happen to be the same? There's also the question of where we store this index. Seems like it ought to be the codebase.

Overall, this one seems like a can of worms.

Option 3

Another option that is growing on me: update doesn't update unique types: unique types cannot be updated via punning on the names coinciding. Instead, the user has to create a fresh type, with a different fully qualified name, and use the replace command to explicitly replace the old type with the new type.

This option seems pretty clean. Rather than doing something silly, or taking a guess at the user's intent, just force the user to be explicit. It might be kind of an annoying DX, but perhaps it could be nicely incorporated into a revamped update workflow. https://github.com/unisonweb/unison/issues/2248

pchiusano avatar Aug 23 '21 23:08 pchiusano

Wrt. Option 2 — we can call this 2b — I meant: When we do an update of unique type Direction, we look up the existing decl Direction (no new index is needed, just the existing index by name), and from there compare the structure of the existing data decl named Direction in the namespace with the structure of the updated data decl called Direction in the scratch file. (All updates are by definition already predicated on a names coinciding, so there's nothing particularly new here, other than how we compare the old and new decls for equality.)

Does this make more sense? Are there still worms?

For Option 1, I agree it's unpalatable to expose the GUIDs, even if it is the simplest solution. For Option 3, I think the DX will be way too painful.

aryairani avatar Aug 24 '21 00:08 aryairani

When we do an update of unique type Direction, we look up the existing decl Direction (no new index is needed, just the existing index by name), and from there compare the structure of the existing data decl named Direction in the namespace with the structure of the updated data decl called Direction in the scratch file. (All updates are by definition already predicated on a names coinciding, so there's nothing particularly new here, other than how we compare the old and new decls for equality.) Does this make more sense? Are there still worms?

Ah, cool. That makes sense. It's good that this doesn't need any new index. I'm sold that this is the nicest DX so let's try to make it work.

I think it might be kind of tricky updating the file parsing / hashing code, but it seems doable?

Just thinking about it..

Currently, we compute hashes for all the types in the file in one go, taking into account cycles. That will need tweaking.

What if instead:

  • For each decl in the file, look up the decl with the corresponding FQN. (This requires having access to the codebase within the parsing / hashing code, or some preprocessing pass to figure out what you'll need to load up front. Possible subtlety - is that FQN already shadowed? Should it look up in old names?)
  • If a unique decl has a FQN the corresponds to a pre-existing unique decl, we should only use that decl's GUID if the structure is the same. How do we determine that, in the presence of cycles and constructors being in different orders?
  • Maybe do one hashing pass that is "structure only", converting all types (including the pre-existing same-named unique types) to structural. Now you have a structural hash for all decls involved, and you can figure out if two structures are the same. If they are, and the FQNs match, then you can inherit the GUID from the pre-existing definition.

pchiusano avatar Aug 24 '21 01:08 pchiusano

I may be misunderstanding or delirious, but I think it can be even simpler than what you described.

WDYT:

In HandleInput.hs -> UpdateI, parsing/hashing has already occurred (i.e. nothing new is needed there), we're passed a TypecheckedUnisonFile. We currently have a pure function toSlurpResult which we call here, and which is the bulk of brains of update. This can be the piece that becomes monadic and loads Decls from the codebase, and I don't think it needs any pre-processing.

  • If a unique decl has a FQN the corresponds to a pre-existing unique decl, we should only use that decl's GUID if the structure is the same. How do we determine that, in the presence of cycles and constructors being in different orders?

Won't the cycles and constructors be in the same order? Particularly if it's type-checked source that we just pretty-printed out of the codebase? Aren't the cycles numbering and constructor orders just dependent on the structure of the Decl? or are they dependent on the GUID too?

aryairani avatar Aug 24 '21 03:08 aryairani

In HandleInput.hs -> UpdateI, parsing/hashing has already occurred (i.e. nothing new is needed there), we're passed a TypecheckedUnisonFile. We currently have a pure function toSlurpResult which we call here, and which is the bulk of brains of update. This can be the piece that becomes monadic and loads Decls from the codebase, and I don't think it needs any pre-processing.

Okay yeah that sounds good! I didn't think of making the change in that spot, was thinking it had to be done earlier, when the hashes in the file are being computed. But you are saying you can leave that alone and just detect it later, before building the update.

I was still thinking of using hashing for determining if two structures are the same, rather than trying to like, align the constructors, which seems easy to get wrong. The hashing will correctly deal with the situation where, like, a unique type is part of a cycle and one of the cycle elements has a different structure (so therefore, all cycle elements have a different structure). And maybe other corner cases I haven't thought of.

pchiusano avatar Aug 24 '21 03:08 pchiusano

Aren't the cycles numbering and constructor orders just dependent on the structure of the Decl? or are they dependent on the GUID too?

If I have:

unique type Foo = Blah | Blah2 Nat

unique ability Woot where
  x : Foo -> Foo -> Nat
  y : Nat

Then Woot's order of constructors depends on the GUID for Foo (it will canonicalize Woot's constructor order based on the hashes of the constructor types - and those hashes incorporate the GUID for Foo).

The constructor orders have been canonicalized by the time the file is typechecked.

So in general, the constructor orders are not going to match between the old and new version of the type. You need to rehash just the structures or using the same GUIDs to get things in the same canonical order.

pchiusano avatar Aug 24 '21 03:08 pchiusano

Hmm, I see; bummer.

Going back to the drawing board, I guess the problem is that ucm sort of needs to know in advance of type-checking why the various definitions are in the scratch file. Is the definition there because the user wants to use and check and add it? Or is it there because the user wanted to have a copy visible for reference in the scratch file? Or is it there because the user hasn't gotten around to deleting it yet?

With a GUI, I could imagine keeping this info around as hidden client-side state (maybe a variation on Option 1 would work), but all we have at the moment for client-side state is the contents of the scratch file, and the whatever is in HandleInput.LoopState.

I was going to suggest an Option 3b where update doesn't update unique types unless the user specifies them explicitly, as in: .tmp> update Direction. But if we have other definitions in the scratch file that depend on Direction, what GUID will they be using? Now we have a mess. I forgot to consider this for Option 2b too; we do need to know our GUIDs before type-checking dependents of unique types.

🤔

aryairani avatar Aug 24 '21 04:08 aryairani

Definitely typing nonsense at this point — will take another look again later.

aryairani avatar Aug 24 '21 04:08 aryairani

I think your idea still works, it's just that the "determine if these two structures are the same" calculation needs to use the algorithm I suggested - unhash the decls in the file, rehash them as "structure only", and also do the same for definitions in the codebase whose names coincide with definitions in the file. Now you have a "structure hash" which has the same meaning for both old and new, which you can use to skip updates for unique types whose structure hasn't changed from the same-named definition in the codebase.

pchiusano avatar Aug 24 '21 14:08 pchiusano

Tangentially related: today we saw a confusing thing here:

```unison
unique ability Foo where
  Op0 : Nat -> ()
```
```ucm
.temp> add
```
```unison
unique ability Foo where
  Op1 : Nat -> ()
```
```ucm
.temp> update
```

Op0 disappears; possibly expected after <IOU ticket reference>.

aryairani avatar Nov 17 '21 17:11 aryairani

@runarorama suggested the following fix, which requires no codebase format changes.

When typechecking a file, we get a list of all the unique types. We then check the namespace for any types with the same names.

We load these types from the codebase, and hash them as structural types. We compare this structural hash to the structural hash of the type in the file. If it's the same, we use the same GUID for the types in the file.

Boom. This achieves the property that if the type has the same structure and the same name, it gets the same guid and hash.

@aryairani

Update: this doesn't quite work because guid affects constructor order which in turn affects the final hash. But what you can do is just try copying the guid from the types in the namespace to the types in the scratch file. Then rehash the type in the scratch file and see if it matches. This has the same effect.

pchiusano avatar Oct 05 '22 17:10 pchiusano

To anyone else who keeps getting bitten by this, here is an awk script that comments out all of the unique type declarations in a scratch file.

ceedubs avatar May 11 '23 01:05 ceedubs

Restating the suggestion:

  1. When typechecking, we get a list of all the unique types in the file. Call one of these tf.
  2. For each tf, check the namespace for a type with the same names. Call this tn.
  3. Load tn from the codebase for its unique guid gtn..
  4. Build tf' = { tf | guid = gtn }.
  5. If hash tf' = hash tn, then replace tf with tf' in the UnisonFile; otherwise leave tf alone.

What if there's more than one tn for a tf? Putting aside "update considered harmful", let's: If there's any tn that matches, just use that corresponding tf' for typechecking. Maybe in the future we could ask the user, or use some other heuristic to decide what to do.

aryairani avatar Jun 23 '23 16:06 aryairani

What if there's more than one tn for a tf? Putting aside "update considered harmful", let's: If there's any tn that matches, just use that corresponding tf' for typechecking. Maybe in the future we could ask the user, or use some other heuristic to decide what to do.

Do you mean pick an arbitrary tn? That is, say the user already has in their namespace

name structural hash guid
Foo #abc 123
Foo #abc 456
Foo #def 789

and they save a .u file with a unique type Foo defined that looks like

name structural hash guid
Foo #abc

then we have two different-but-equally-valid GUIDs to reuse, so we pick one, and proceed to typecheck?

name structural hash guid
Foo #abc arbitrarily pick 123 or 456

mitchellwrosen avatar Jun 23 '23 19:06 mitchellwrosen

Yeah, I think that makes as much sense as anything. One of them will look stable and the other one will look like churn, but we'll be reducing churn, and in most cases eliminating it.

In the future we may want to do something special when updateing when there's a conflict, and that something special will be applicable here too.

aryairani avatar Jun 27 '23 02:06 aryairani