language
language copied to clipboard
Records: What are the semantics of identity?
The records proposal specifies identity here to be structural. In particular, identity is required to return true
for any two records with the same shape for which the fields are pointwise identical.
We have discussed an alternative specification which would look more like the following:
The identity operator on records may return true for records such that:
- Both have the same shape.
- For every pair of corresponding fields, identical could validly return true on that pair. The identity operator on records may always return false.
In either of the two formulations compilers required neither to preserve pointer identity, nor to preserve pointer dis-identity. That is, a compiler may unbox and re-allocate a record freely, or it may replace two source level records with identical components with a single allocation.
However, in the second formulation, compilers are free to implement identity as pointer identity, rather than requiring a slow recursive traversal. This also has implications even for non-record based code, since all identity checks for which the compiler cannot prove that a record reaches neither argument must now include conditional logic to handle the case of records.
Should we change this specification?
cc @munificent @eernstg @lrhn @jakemac53 @natebosch @stereotype441 @rakudrama @mraleph @alexmarkov
I'm 100% OK with that if the rest of the language team is.
If we choose the second formulation it is easier to change our mind later rather than vice versa. We will be in a better position to do an experiment on the impact of structural identity with everything else implemented, for example, having at hand the code generation for structural equality.
I also prefer the approach where the compiler is allowed to implement identity as pointer identity.
I'm 100% OK with that if the rest of the language team is.
In our last language team meeting I felt like I was the last holdout who wasn't ok with this change, but y'all convinced me.
I think it's fine too.
We should be aware that it probably means that some programming patterns will consistently give identical records, and some will not, and people may start depending on the concrete implementation, to the point where we can't change it anyway because it breaks some semi-large number of tests somewhere.
Would be we be willing to add a randomness-factor in debug mode? Say, identity of structs becomes pointer identity && someRandom.nextInt(10) > 0
- or something similar, with big enough chance that code depending on identity will probably fail in testing occasionally, but other code will mainly run in the same track as production mode.
Whatever we do here, we should probably do precisely the same for data-classes — with some suitable definition of "same shape", which for data-classes is likely the same struct declaration as runtime type, and "same type arguments" (however that's defined).
One thing to consider: Can identical return true
for records that were not declared with identical content, as an optimization.
Consider a program where the record shape (_, _, _, foo: _, bar: _)
is recognized by the compiler as never using its second value. That is, the compiler can say, for sure, that for (a, b, c, foo: d, bar: e)
, the b
value is never read or used in any way.
So the compiler wants to optimize it away, storing only four values for each record of that type.
The someone asks whether identical((a, b, c, foo:d, bar: e), (a, b2, c, foo:d, bar: e))
.
If the compiler has actually optimized away the b
/b2
values at runtime, then it can't tell the two records apart.
So, if the identical
specification requires those two records to not be identical, it prevents the optimization of removing otherwise unnecessary data.
Do we want to prevent that? Or would it be fine to say true
for identical, if the program cannot distinguish the two values, whether or not other programs could?
I think if we don't want to use structural equality for identity, then we will make identical
on record useless for any practical purposes. It's probably fine as there is still operator ==
.
The 2nd option suggested above makes identical
under-specified - identical
has too much freedom to return true and false. Maybe we can just say explicitly that identical
on records can return anything because records do not have an identity and implementation can decompose and re-create record instances at any time as needed. This would allow any optimizations including what @lrhn suggested above.
Even better option would be to always return false
from identical
on a record. This way it would be fully specified and deterministic, with the same outcome of identical
not being useful on records.
We may or may not need to specify identityHashCode
on records.
The identityHashCode
function returns a hash code which is best-effort-useful and compatible with identical
. It's valid as long as it returns the same value for the same object each time.
For records, it's practically useless. The one requirement of an identity hash code is that for identical values it should return the same value. For records, which do not have a consistent identity over time, generating a new random value for each call is compatible with identical
, because identical
can return false, and the identity can change in the time between calling identityHashCode
and identical
.
Which means we don't have to care about the implementation of identityHashCode
, but we may want to document its uselessness on records very prominently.
Even better option would be to always return
false
fromidentical
on a record. This way it would be fully specified and deterministic, with the same outcome ofidentical
not being useful on records.
I would be against this as it required testing that an operand is a record for very little benefit.
The 'unspecified' option allows JavaScript to compile identical(a, b)
to a === b
.
Always returning false
means that this will need to be something like a === b && !(a instanceof _RecordImplBase)
.
This is an expensive choice for the utility provided - programs that use identical
on non-records in a general context are more expensive (e.g. in Map.identity
we probably can't prove a key is not a record).
If we can't trace the record to know exactly what operations happen on a record, then we must assume ==
and hashCode
are called. There is very little advantage to making identical
more special than ==
.
dart2js has an optimization that erases unused fields, but it would be confused by the presence of ==
that uses the field.
Good point that ==
should perhaps keep all fields alive, and it's going to be very hard to see that ==
is not called.
(For web compilation, we should totally make integers structs with no equality, so it's no longer wrong to have identical(nan, nan)
be false. If we can't switch to Object.is
.)