ArcticDB
ArcticDB copied to clipboard
V2 encoding does not hash descriptor
Describe the bug
When a dataframe has columns that contain the same data across a column slice, the hash of the columns data is identical. Normally the keys would be disambiguated by the nanosecond timestamp, but currently the MacOS build has coarser timestamps exposing a problem where specifically in the V2 encoding the column names aren't being hashed.
Note that this is not a problem with identical data between row slices as keys include a start and end index, and these will always be different for different row slices.
We should write a test such that different column slices with identical data always produce a different hash. This should currently succeed in the v1 encoding (which is the only one in use at the moment) and fail in the V2 encoding. Then we should ensure that the hash that is generated as the field collection of the stream descriptor is included in the overall hash.
Note that this only occurs on LMDB as the other storages don't care about duplicate keys, although writing duplicates could have other bad effects later on, for example when we come to delete them
As a separate change, it would probably be a good idea to use a nanosecond-resolution clock in MacOS
Steps/Code to Reproduce
Currently failing test test_long_stream_descriptor_mismatch for encoding V2 on MacOS
Expected Results
Hashes should be different
OS, Python Version and ArcticDB Version
All versions
Backend storage used
LMDB
Additional Context
No response
Relates to https://github.com/man-group/ArcticDB/issues/692