snowflake-ingest-java
snowflake-ingest-java copied to clipboard
SNOW-682477 Unicode and collations support
This PR fixes several issues related to handling of strings, particularly related to non ascii characters nad collations. Merging must be postponed until server-side counterpart is deployed.
This PR must be released as the SDK version "1.0.2-beta.7" because since this version the server will interpred EP values differently.
This PR fixes the following issues:
SNOW-686944 Min/Max string values were being trucated to 32 characters, not to 32 bytes, which is what SF expects. For Unicode strings the values were therefore longer, another truncation was being done on server side, which is always truncating down, leading to invalid max values.
SNOW-682477 Max length has to be reported in bytes, not in characters.
SNOW-663621 String comparison and truncation has to be compatible with XP, otherwise metadata checker incidents are raised due to mismatches.
SNOW-693446 Fixed transformation of collated strings. Collated bytes need to be calculated from full string and only then truncated. Min/max non-collated strings should be set to null if the string is longer than 32 bytes. This patch also downgrades the ICU library to behave consistently with XP and GS.
Testing Tests have been added for all kinds of string comparison and truncation. A randomized testing has been added, which generates random unicode strings, ingests and migrates them, ensuring the data and metadata are consistent. Additional integration tests for collation support have been added.