Fixed multiple tests in JsonIngestionFromAvroQueriesTest
Fixes #17376
Motivation
Multiple tests in JsonIngestionFromAvroQueriesTest exhibit non-deterministic failures due to a combination of issues related to HashMap iteration order and incorrect dictionary lookup behavior:
-
Dictionary lookup silently returns incorrect values from row 0 on unsuccessful key match instead of failing. SegmentDictionaryCreator uses FastUtil's Double2IntOpenHashMap, Float2IntOpenHashMap, Int2IntOpenHashMap, Long2IntOpenHashMap, and Object2IntOpenHashMap for dictionary lookups. According to FastUtil documentation, these maps return a default value of 0 when a key is not found via get() or getInt(). This causes data corruption: when a lookup fails, the code silently uses dictionary ID 0, causing row 0's value to incorrectly appear in other rows.
-
Non-deterministic HashMap iteration in Avro deserialization. Apache Avro's
GenericDatumReaderuses HashMap internally (viaGenericData.newMap()), which provides no iteration order guarantees per the HashMap specification. Serializing and deserializing the Avro records containing Maps may result in different key orders across ingestion passes (on storage and indexing). -
JSON key-value pair ordering is not semantically significant. Per the JSON specification, "an object is an unordered set of name/value pairs". String comparison of JSON objects with different key orderings incorrectly treats
{"a":"1","b":"2"}and{"b":"2","a":"1"}as distinct values, when they are semantically equivalent.
The ordering for each of these order-related problems can change due to different environments producing the contents in different orders despite the logical contents being the same. This harmless re-ordering can surface as test failures and allow incorrect dictionary values to be silently used.
Modifications
SegmentDictionaryCreator.java
The FastUtil dictionary maps have been explicitly configured to return -1 on missing keys to prevent silent corruption. Dictionary lookups have been wrapped in checkIdx which returns the dictionary ID when found, attempts JSON-aware normalization for string values, and throws a descriptive IllegalStateException if no match exists. JSON-aware comparison utilities using Jackson have also been added which perform structural comparison via JsonNode.equals() to ignore object key order and to preserve array ordering semantics.
JsonIngestionFromAvroQueriesTest
Added assertJsonEquals() to compare JSON values structurally rather than lexically. testSimpleSelectOnJsonColumn() and testComplexSelectOnJsoncolumn() were refactored to apply column structural comparison only to JSON columns and to use column-level assertions (if needed).
In essence, these changes keep the spirit of the original code while eliminating silent corruptions of data and failures caused by allowed (but previously unexpected) reordering.
Verifying this change
- [x] Make sure that the change passes all checks by running
mvn clean install -Pbin-distlocally.
Fixed tests
-
JsonIngestionFromAvroQueriesTest.testJsonPathSelectOnJsonColumn -
JsonIngestionFromAvroQueriesTest.testComplexSelectOnJsonColumn -
JsonIngestionFromAvroQueriesTest.testSimpleSelectOnJsonColumn