language icon indicating copy to clipboard operation
language copied to clipboard

Records: What are the semantics of identity?

Open leafpetersen opened this issue 1 year ago • 6 comments

The records proposal specifies identity here to be structural. In particular, identity is required to return true for any two records with the same shape for which the fields are pointwise identical.

We have discussed an alternative specification which would look more like the following:

The identity operator on records may return true for records such that:

  • Both have the same shape.
  • For every pair of corresponding fields, identical could validly return true on that pair. The identity operator on records may always return false.

In either of the two formulations compilers required neither to preserve pointer identity, nor to preserve pointer dis-identity. That is, a compiler may unbox and re-allocate a record freely, or it may replace two source level records with identical components with a single allocation.

However, in the second formulation, compilers are free to implement identity as pointer identity, rather than requiring a slow recursive traversal. This also has implications even for non-record based code, since all identity checks for which the compiler cannot prove that a record reaches neither argument must now include conditional logic to handle the case of records.

Should we change this specification?

cc @munificent @eernstg @lrhn @jakemac53 @natebosch @stereotype441 @rakudrama @mraleph @alexmarkov

leafpetersen avatar Aug 06 '22 00:08 leafpetersen

I'm 100% OK with that if the rest of the language team is.

munificent avatar Aug 06 '22 00:08 munificent

If we choose the second formulation it is easier to change our mind later rather than vice versa. We will be in a better position to do an experiment on the impact of structural identity with everything else implemented, for example, having at hand the code generation for structural equality.

rakudrama avatar Aug 06 '22 01:08 rakudrama

I also prefer the approach where the compiler is allowed to implement identity as pointer identity.

eernstg avatar Aug 06 '22 06:08 eernstg

I'm 100% OK with that if the rest of the language team is.

In our last language team meeting I felt like I was the last holdout who wasn't ok with this change, but y'all convinced me.

stereotype441 avatar Aug 06 '22 13:08 stereotype441

I think it's fine too.

We should be aware that it probably means that some programming patterns will consistently give identical records, and some will not, and people may start depending on the concrete implementation, to the point where we can't change it anyway because it breaks some semi-large number of tests somewhere.

Would be we be willing to add a randomness-factor in debug mode? Say, identity of structs becomes pointer identity && someRandom.nextInt(10) > 0 - or something similar, with big enough chance that code depending on identity will probably fail in testing occasionally, but other code will mainly run in the same track as production mode.

lrhn avatar Aug 06 '22 21:08 lrhn

Whatever we do here, we should probably do precisely the same for data-classes — with some suitable definition of "same shape", which for data-classes is likely the same struct declaration as runtime type, and "same type arguments" (however that's defined).

lrhn avatar Aug 09 '22 12:08 lrhn

One thing to consider: Can identical return true for records that were not declared with identical content, as an optimization.

Consider a program where the record shape (_, _, _, foo: _, bar: _) is recognized by the compiler as never using its second value. That is, the compiler can say, for sure, that for (a, b, c, foo: d, bar: e), the b value is never read or used in any way. So the compiler wants to optimize it away, storing only four values for each record of that type.

The someone asks whether identical((a, b, c, foo:d, bar: e), (a, b2, c, foo:d, bar: e)). If the compiler has actually optimized away the b/b2 values at runtime, then it can't tell the two records apart.

So, if the identical specification requires those two records to not be identical, it prevents the optimization of removing otherwise unnecessary data.

Do we want to prevent that? Or would it be fine to say true for identical, if the program cannot distinguish the two values, whether or not other programs could?

lrhn avatar Aug 18 '22 11:08 lrhn

I think if we don't want to use structural equality for identity, then we will make identical on record useless for any practical purposes. It's probably fine as there is still operator ==.

The 2nd option suggested above makes identical under-specified - identical has too much freedom to return true and false. Maybe we can just say explicitly that identical on records can return anything because records do not have an identity and implementation can decompose and re-create record instances at any time as needed. This would allow any optimizations including what @lrhn suggested above.

Even better option would be to always return false from identical on a record. This way it would be fully specified and deterministic, with the same outcome of identical not being useful on records.

alexmarkov avatar Aug 18 '22 14:08 alexmarkov

We may or may not need to specify identityHashCode on records.

The identityHashCode function returns a hash code which is best-effort-useful and compatible with identical. It's valid as long as it returns the same value for the same object each time.

For records, it's practically useless. The one requirement of an identity hash code is that for identical values it should return the same value. For records, which do not have a consistent identity over time, generating a new random value for each call is compatible with identical, because identical can return false, and the identity can change in the time between calling identityHashCode and identical.

Which means we don't have to care about the implementation of identityHashCode, but we may want to document its uselessness on records very prominently.

lrhn avatar Aug 18 '22 15:08 lrhn

Even better option would be to always return false from identical on a record. This way it would be fully specified and deterministic, with the same outcome of identical not being useful on records.

I would be against this as it required testing that an operand is a record for very little benefit. The 'unspecified' option allows JavaScript to compile identical(a, b) to a === b. Always returning false means that this will need to be something like a === b && !(a instanceof _RecordImplBase). This is an expensive choice for the utility provided - programs that use identical on non-records in a general context are more expensive (e.g. in Map.identity we probably can't prove a key is not a record).

If we can't trace the record to know exactly what operations happen on a record, then we must assume == and hashCode are called. There is very little advantage to making identical more special than ==.

dart2js has an optimization that erases unused fields, but it would be confused by the presence of == that uses the field.

rakudrama avatar Aug 18 '22 16:08 rakudrama

Good point that == should perhaps keep all fields alive, and it's going to be very hard to see that == is not called.

(For web compilation, we should totally make integers structs with no equality, so it's no longer wrong to have identical(nan, nan) be false. If we can't switch to Object.is.)

lrhn avatar Aug 18 '22 17:08 lrhn