ArcticDB icon indicating copy to clipboard operation
ArcticDB copied to clipboard

V2 encoding does not hash descriptor

Open willdealtry opened this issue 2 years ago • 1 comments
trafficstars

Describe the bug

When a dataframe has columns that contain the same data across a column slice, the hash of the columns data is identical. Normally the keys would be disambiguated by the nanosecond timestamp, but currently the MacOS build has coarser timestamps exposing a problem where specifically in the V2 encoding the column names aren't being hashed.

Note that this is not a problem with identical data between row slices as keys include a start and end index, and these will always be different for different row slices.

We should write a test such that different column slices with identical data always produce a different hash. This should currently succeed in the v1 encoding (which is the only one in use at the moment) and fail in the V2 encoding. Then we should ensure that the hash that is generated as the field collection of the stream descriptor is included in the overall hash.

Note that this only occurs on LMDB as the other storages don't care about duplicate keys, although writing duplicates could have other bad effects later on, for example when we come to delete them

As a separate change, it would probably be a good idea to use a nanosecond-resolution clock in MacOS

Steps/Code to Reproduce

Currently failing test test_long_stream_descriptor_mismatch for encoding V2 on MacOS

Expected Results

Hashes should be different

OS, Python Version and ArcticDB Version

All versions

Backend storage used

LMDB

Additional Context

No response

willdealtry avatar Aug 08 '23 17:08 willdealtry

Relates to https://github.com/man-group/ArcticDB/issues/692

mehertz avatar Aug 09 '23 12:08 mehertz