[proposition] mask-based implementation for missing values
Currently, saddle uses one value of each primitive type to represent NA. For floating point numbers, this is straightforward as they already include such a value. For other types (Boolean, Byte, Int, etc), it isn't straightforward and an arbitrary value must be used. Currently the minimum value is used (Byte.MinValue, Short.MinValue, etc).
I think this approach has important drawbacks:
- users are unlikely to know about that encoding and could use the min values. That leads to surprising behavior.
- operations resulting in the
MinValuewould result in a missing value. - the binary operations on collections lose in simplicity. Implementation such as
if (tag.isMissing(v1)) v1 else v1 + 2might prevent loop optimizations of the jvm to kick-in. - the
.raw(i)-like api exposes unnecessary complexity to the user.
An alternative approach would be to use a mask-based implementation for the integer-based Vec[T]s. That is, the vector stores a companion Array[Boolean] indicating missing value. This approach is used by pandas.
Some questions arise:
- would the binary operations execute the op in the missing positions (to omit the branching) then union the mask?
- what would be the return type of the method which extracts a value? (Now raw )
- is the mask sparse or dense?
At the moment saddle provides two copies from many operations (eg max, max2) one omitting the missingness check the other respecting missigness. All code in linalg is oblivious to missingness.
would the binary operations execute the op in the missing positions (to omit the branching) then union the mask?
That's an approach indeed. So as to make sure to enable vectorization. To be validated with benchmarking. Also, the vector could have a field hasNAs. If the field is false, the whole NA processing could be skipped.
what would be the return type of the method which extracts a value? (Now raw )
There are two variants if I understand correctly. The boxed one returning Scalar. That one would remain untouched. The raw one could also remain untouched as well with the note that if the corresponding position in the vector is a NA, the value returned has no meaning and could be anything.
is the mask sparse or dense?
I'd start with dense for simplicity? also along the hasNAs idea above, the bool array could not be allocated if the constructor ensures there is no NA values. That said, since Vec is a mutable structure, that might not be good idea.
At the moment saddle provides two copies from many operations (eg max, max2) one omitting the missingness check the other respecting missigness. All code in linalg is oblivious to missingness.
These aspects of saddle are surprising imo. I assumed these 2 implementations were semantically equivalent.
I see. I think I concur with most of these.
My only remaining concern is that I believe there should be a way to extract a Double from a Vec[Double] without going through Scalar, except if we are sure that the VM eliminates the allocations the Scalar on branch less code path (e.g. vec.apply(0).get). This is to enable tight loops on a Vec.