druid add support for 'front coded' string dictionaries for smaller string columns

Closes #3922

Description

This PR adds a new way of storing STRING typed columns, using an incremental encoding strategy called 'front coding'. Essentially, sorted values are split into buckets; the first value in the bucket is written completely and the remaining values in the bucket are stored as a pair, composed of the length of the first string which overlaps this value (the prefix length), and the remaining fragment of the value that remains after the prefix. Using the quickstart wikipedia example, this results in nicely reduced segment sizes, with relatively little performance penalty in most typical Druid queries.

Segment Size and Performance

Wikipedia "quickstart" segments: Screen Shot 2022-02-23 at 3 19 33 AM

GenericIndexed<ByteBuffer> "singleThreaded" vs FrontCodedIndexed with bucket size 4 and size 16:

Benchmark                                            (indexType)  (numElements)  (numOperations)  (width)  Mode  Cnt        Score    Error  Units
FrontCodedIndexedBenchmark.get                           generic          10000            10000       16  avgt    5       12.842 ±  0.309  ns/op
FrontCodedIndexedBenchmark.get                           generic         100000            10000       16  avgt    5       14.173 ±  0.259  ns/op
FrontCodedIndexedBenchmark.get                     front-coded-4          10000            10000       16  avgt    5       39.418 ±  0.989  ns/op
FrontCodedIndexedBenchmark.get                     front-coded-4         100000            10000       16  avgt    5       43.614 ±  1.081  ns/op
FrontCodedIndexedBenchmark.get                    front-coded-16          10000            10000       16  avgt    5       68.667 ±  0.674  ns/op
FrontCodedIndexedBenchmark.get                    front-coded-16         100000            10000       16  avgt    5       72.695 ±  0.170  ns/op
FrontCodedIndexedBenchmark.indexOf                       generic          10000            10000       16  avgt    5      161.167 ±  1.945  ns/op
FrontCodedIndexedBenchmark.indexOf                       generic         100000            10000       16  avgt    5      229.376 ±  4.963  ns/op
FrontCodedIndexedBenchmark.indexOf                 front-coded-4          10000            10000       16  avgt    5      243.259 ±  0.290  ns/op
FrontCodedIndexedBenchmark.indexOf                 front-coded-4         100000            10000       16  avgt    5      377.807 ±  5.362  ns/op
FrontCodedIndexedBenchmark.indexOf                front-coded-16          10000            10000       16  avgt    5      252.050 ±  0.786  ns/op
FrontCodedIndexedBenchmark.indexOf                front-coded-16         100000            10000       16  avgt    5      381.047 ±  4.761  ns/op
FrontCodedIndexedBenchmark.iterator                      generic          10000            10000       16  avgt    5       39.341 ±  0.623  ns/op
FrontCodedIndexedBenchmark.iterator                      generic         100000            10000       16  avgt    5      384.242 ±  0.899  ns/op
FrontCodedIndexedBenchmark.iterator                front-coded-4          10000            10000       16  avgt    5       95.082 ± 18.971  ns/op
FrontCodedIndexedBenchmark.iterator                front-coded-4         100000            10000       16  avgt    5      690.636 ± 27.592  ns/op
FrontCodedIndexedBenchmark.iterator               front-coded-16          10000            10000       16  avgt    5       77.685 ± 29.205  ns/op
FrontCodedIndexedBenchmark.iterator               front-coded-16         100000            10000       16  avgt    5      606.071 ± 25.713  ns/op


FrontCodedIndexedBenchmark.get:encoded size              generic          10000            10000       16  avgt    5   240010.000           bytes
FrontCodedIndexedBenchmark.get:encoded size              generic         100000            10000       16  avgt    5  2400010.000           bytes
FrontCodedIndexedBenchmark.get:encoded size        front-coded-4          10000            10000       16  avgt    5   169992.000           bytes
FrontCodedIndexedBenchmark.get:encoded size        front-coded-4         100000            10000       16  avgt    5  1636900.000           bytes
FrontCodedIndexedBenchmark.get:encoded size       front-coded-16          10000            10000       16  avgt    5   164187.000           bytes
FrontCodedIndexedBenchmark.get:encoded size       front-coded-16         100000            10000       16  avgt    5  1564828.000           bytes

How this translates into queries has so far been measured with SqlBenchmark, though it's generated data-set does pretty poorly with this encoding due to being composed of numbers which have been translated directly into strings, leaving few prefixes to take advantage of (383MB with generic indexed compared to 381MB with front-coded indexed). This means that any size advantage is not present here and so this is likely near worst case for this encoding strategy.

SELECT SUM(sumLongSequential), SUM(sumFloatNormal) FROM foo WHERE dimSequential NOT LIKE '%3'

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql        4           5000000           mmap              none        false  avgt    5     94.069 ±   1.454  ms/op
SqlBenchmark.querySql        4           5000000           mmap              none        force  avgt    5     45.142 ±   1.205  ms/op
SqlBenchmark.querySql        4           5000000           mmap     front-coded-4        false  avgt    5     90.089 ±   1.271  ms/op
SqlBenchmark.querySql        4           5000000           mmap     front-coded-4        force  avgt    5     44.842 ±   1.130  ms/op
SqlBenchmark.querySql        4           5000000           mmap    front-coded-16        false  avgt    5     90.121 ±   1.020  ms/op
SqlBenchmark.querySql        4           5000000           mmap    front-coded-16        force  avgt    5     44.965 ±   1.256  ms/op


-------

SELECT SUM(sumLongSequential), SUM(sumFloatNormal) FROM foo WHERE dimSequential = '311'

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql        5           5000000           mmap              none        false  avgt    5     23.650 ±   0.605  ms/op
SqlBenchmark.querySql        5           5000000           mmap              none        force  avgt    5     22.420 ±   0.496  ms/op
SqlBenchmark.querySql        5           5000000           mmap     front-coded-4        false  avgt    5     23.756 ±   0.766  ms/op
SqlBenchmark.querySql        5           5000000           mmap     front-coded-4        force  avgt    5     22.045 ±   0.551  ms/op
SqlBenchmark.querySql        5           5000000           mmap    front-coded-16        false  avgt    5     23.671 ±   0.603  ms/op
SqlBenchmark.querySql        5           5000000           mmap    front-coded-16        force  avgt    5     22.044 ±   0.638  ms/op


-------

SELECT SUM(sumLongSequential), SUM(sumFloatNormal) FROM foo WHERE dimSequential NOT LIKE '%3' AND maxLongUniform > 10

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql        6           5000000           mmap              none        false  avgt    5     48.876 ±   1.715  ms/op
SqlBenchmark.querySql        6           5000000           mmap              none        force  avgt    5     23.413 ±   0.713  ms/op
SqlBenchmark.querySql        6           5000000           mmap     front-coded-4        false  avgt    5     48.864 ±   1.748  ms/op
SqlBenchmark.querySql        6           5000000           mmap     front-coded-4        force  avgt    5     23.434 ±   0.901  ms/op
SqlBenchmark.querySql        6           5000000           mmap    front-coded-16        false  avgt    5     48.821 ±   1.737  ms/op
SqlBenchmark.querySql        6           5000000           mmap    front-coded-16        force  avgt    5     23.495 ±   0.684  ms/op


-------

SELECT
  SUM(sumLongSequential) FILTER(WHERE dimSequential = '311'),
  SUM(sumFloatNormal)
FROM foo WHERE dimSequential NOT LIKE '%3'

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql        7           5000000           mmap              none        false  avgt    5     96.444 ±   3.474  ms/op
SqlBenchmark.querySql        7           5000000           mmap              none        force  avgt    5     51.797 ±   1.244  ms/op
SqlBenchmark.querySql        7           5000000           mmap     front-coded-4        false  avgt    5     96.150 ±   3.059  ms/op
SqlBenchmark.querySql        7           5000000           mmap     front-coded-4        force  avgt    5     51.678 ±   0.869  ms/op
SqlBenchmark.querySql        7           5000000           mmap    front-coded-16        false  avgt    5    107.410 ±   2.924  ms/op
SqlBenchmark.querySql        7           5000000           mmap    front-coded-16        force  avgt    5     51.833 ±   1.156  ms/op


-------

(timeseries bunch of filtered aggs)

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql        8           5000000           mmap              none        false  avgt    5    893.378 ±   3.220  ms/op
SqlBenchmark.querySql        8           5000000           mmap              none        force  avgt    5    243.069 ±   3.936  ms/op
SqlBenchmark.querySql        8           5000000           mmap     front-coded-4        false  avgt    5    867.778 ±  13.031  ms/op
SqlBenchmark.querySql        8           5000000           mmap     front-coded-4        force  avgt    5    340.799 ±   4.697  ms/op
SqlBenchmark.querySql        8           5000000           mmap    front-coded-16        false  avgt    5    868.694 ±  18.088  ms/op
SqlBenchmark.querySql        8           5000000           mmap    front-coded-16        force  avgt    5    352.819 ±   2.469  ms/op


-------

SELECT dimSequential, dimZipf, SUM(sumLongSequential) FROM foo GROUP BY 1, 2

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql       10           5000000           mmap              none        false  avgt    5    430.592 ±  34.032  ms/op
SqlBenchmark.querySql       10           5000000           mmap              none        force  avgt    5    235.971 ±   3.405  ms/op
SqlBenchmark.querySql       10           5000000           mmap     front-coded-4        false  avgt    5    426.965 ±   9.557  ms/op
SqlBenchmark.querySql       10           5000000           mmap     front-coded-4        force  avgt    5    247.754 ±   2.635  ms/op
SqlBenchmark.querySql       10           5000000           mmap    front-coded-16        false  avgt    5    434.664 ±  10.247  ms/op
SqlBenchmark.querySql       10           5000000           mmap    front-coded-16        force  avgt    5    250.683 ±   3.340  ms/op


-------

SELECT dimSequential, dimZipf, SUM(sumLongSequential), COUNT(*) FROM foo GROUP BY 1, 2

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql       11           5000000           mmap              none        false  avgt    5    433.815 ±   7.673  ms/op
SqlBenchmark.querySql       11           5000000           mmap              none        force  avgt    5    255.151 ±   1.853  ms/op
SqlBenchmark.querySql       11           5000000           mmap     front-coded-4        false  avgt    5    442.550 ±   9.269  ms/op
SqlBenchmark.querySql       11           5000000           mmap     front-coded-4        force  avgt    5    263.782 ±   3.357  ms/op
SqlBenchmark.querySql       11           5000000           mmap    front-coded-16        false  avgt    5    441.626 ±   8.954  ms/op
SqlBenchmark.querySql       11           5000000           mmap    front-coded-16        force  avgt    5    268.237 ±   2.834  ms/op


-------

SELECT dimZipf FROM foo GROUP BY 1

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql       12           5000000           mmap              none        false  avgt    5     42.031 ±   1.408  ms/op
SqlBenchmark.querySql       12           5000000           mmap              none        force  avgt    5     24.031 ±   0.881  ms/op
SqlBenchmark.querySql       12           5000000           mmap     front-coded-4        false  avgt    5     42.393 ±   0.648  ms/op
SqlBenchmark.querySql       12           5000000           mmap     front-coded-4        force  avgt    5     24.055 ±   1.008  ms/op
SqlBenchmark.querySql       12           5000000           mmap    front-coded-16        false  avgt    5     42.255 ±   1.134  ms/op
SqlBenchmark.querySql       12           5000000           mmap    front-coded-16        force  avgt    5     22.201 ±   1.120  ms/op


-------

(big union)

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql       19           5000000           mmap              none        false  avgt    5    365.313 ±   5.432  ms/op
SqlBenchmark.querySql       19           5000000           mmap              none        force  avgt    5    315.686 ±   6.720  ms/op
SqlBenchmark.querySql       19           5000000           mmap     front-coded-4        false  avgt    5    477.117 ±  67.235  ms/op
SqlBenchmark.querySql       19           5000000           mmap     front-coded-4        force  avgt    5    404.028 ±   8.931  ms/op
SqlBenchmark.querySql       19           5000000           mmap    front-coded-16        false  avgt    5    489.550 ±  11.902  ms/op
SqlBenchmark.querySql       19           5000000           mmap    front-coded-16        force  avgt    5    443.758 ±   9.031  ms/op


-------

SELECT dimSequential, dimZipf, SUM(sumLongSequential) FROM foo WHERE dimUniform NOT LIKE '%3' GROUP BY 1, 2

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql       21           5000000           mmap              none        false  avgt    5    444.995 ±   5.940  ms/op
SqlBenchmark.querySql       21           5000000           mmap              none        force  avgt    5    273.133 ±   1.405  ms/op
SqlBenchmark.querySql       21           5000000           mmap     front-coded-4        false  avgt    5    450.684 ±  13.259  ms/op
SqlBenchmark.querySql       21           5000000           mmap     front-coded-4        force  avgt    5    279.336 ±   2.098  ms/op
SqlBenchmark.querySql       21           5000000           mmap    front-coded-16        false  avgt    5    456.444 ±   9.268  ms/op
SqlBenchmark.querySql       21           5000000           mmap    front-coded-16        force  avgt    5    283.369 ±   2.217  ms/op


-------

SELECT dimZipf, SUM(sumLongSequential) FROM foo WHERE dimSequential = '311' GROUP BY 1 ORDER BY 1

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql       22           5000000           mmap              none        false  avgt    5     15.510 ±   0.522  ms/op
SqlBenchmark.querySql       22           5000000           mmap              none        force  avgt    5     15.380 ±   0.475  ms/op
SqlBenchmark.querySql       22           5000000           mmap     front-coded-4        false  avgt    5     15.565 ±   0.652  ms/op
SqlBenchmark.querySql       22           5000000           mmap     front-coded-4        force  avgt    5     15.372 ±   0.504  ms/op
SqlBenchmark.querySql       22           5000000           mmap    front-coded-16        false  avgt    5     15.544 ±   0.667  ms/op
SqlBenchmark.querySql       22           5000000           mmap    front-coded-16        force  avgt    5     15.496 ±   0.441  ms/op


-------

SELECT * FROM foo

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql       23           5000000           mmap              none        force  avgt    5  11097.105 ±  40.113  ms/op
SqlBenchmark.querySql       23           5000000           mmap     front-coded-4        force  avgt    5  11968.038 ± 125.990  ms/op
SqlBenchmark.querySql       23           5000000           mmap    front-coded-16        force  avgt    5  13985.780 ±  74.495  ms/op


-------

SELECT * FROM foo WHERE dimSequential IN ('1', '2', '3', '4', '5', '10', '11', '20', '21', '23', '40', '50', '64', '70', '100')

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql       24           5000000           mmap              none        force  avgt    5    255.461 ±   2.556  ms/op
SqlBenchmark.querySql       24           5000000           mmap     front-coded-4        force  avgt    5    258.501 ±   2.187  ms/op
SqlBenchmark.querySql       24           5000000           mmap    front-coded-16        force  avgt    5    298.738 ±   2.915  ms/op


-------

SELECT * FROM foo WHERE dimSequential > '10' AND dimSequential < '8500'

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql       25           5000000           mmap              none        force  avgt    5   9171.879 ±  25.495  ms/op
SqlBenchmark.querySql       25           5000000           mmap     front-coded-4        force  avgt    5   8819.167 ±  50.877  ms/op
SqlBenchmark.querySql       25           5000000           mmap    front-coded-16        force  avgt    5  10618.384 ± 124.553  ms/op


-------

SELECT dimSequential, dimZipf, SUM(sumLongSequential) FROM foo WHERE dimSequential IN ('1', '2', '3', '4', '5', '10', '11', '20', '21', '23', '40', '50', '64', '70', '100') GROUP BY 1, 2

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql       26           5000000           mmap              none        false  avgt    5     22.775 ±   0.811  ms/op
SqlBenchmark.querySql       26           5000000           mmap              none        force  avgt    5     18.762 ±   0.654  ms/op
SqlBenchmark.querySql       26           5000000           mmap     front-coded-4        false  avgt    5     22.925 ±   0.772  ms/op
SqlBenchmark.querySql       26           5000000           mmap     front-coded-4        force  avgt    5     18.920 ±   0.610  ms/op
SqlBenchmark.querySql       26           5000000           mmap    front-coded-16        false  avgt    5     22.989 ±   0.813  ms/op
SqlBenchmark.querySql       26           5000000           mmap    front-coded-16        force  avgt    5     18.989 ±   0.571  ms/op


-------

SELECT dimSequential, dimZipf, SUM(sumLongSequential) FROM foo WHERE dimSequential > '10' AND dimSequential < '8500' GROUP BY 1, 2

Benchmark              (query)  (rowsPerSegment)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt      Score     Error  Units
SqlBenchmark.querySql       27           5000000           mmap              none        false  avgt    5    377.893 ±  13.541  ms/op
SqlBenchmark.querySql       27           5000000           mmap              none        force  avgt    5    228.251 ±   2.652  ms/op
SqlBenchmark.querySql       27           5000000           mmap     front-coded-4        false  avgt    5    387.915 ±  11.770  ms/op
SqlBenchmark.querySql       27           5000000           mmap     front-coded-4        force  avgt    5    234.894 ±   2.848  ms/op
SqlBenchmark.querySql       27           5000000           mmap    front-coded-16        false  avgt    5    400.435 ±  10.410  ms/op
SqlBenchmark.querySql       27           5000000           mmap    front-coded-16        force  avgt    5    237.497 ±   2.143  ms/op

I have yet to perform larger scale testing on actual clusters with real workloads, but the benchmark results look very promising at this point and show very little overhead at query time, with a decent chance the reduced segment sizes will more than make up for it.

Design

The encoding itself is done within a new Indexed implementation, FrontCodedIndexed, which contains all of the methods for reading and writing the buckets of values. I adapted 'variable byte' encoding for integer values from JavaFastPFOR to write both string value lengths as well as prefix lengths.

string layout:

length	value
vbyte int	byte[]

bucket layout:

first string	prefix length	string fragment	...	prefix length	string fragment
byte[]	vbyte int	byte[]	...	vbyte int	byte[]

front coded indexed layout:

version	bucket size	has null?	number of values	size of "offsets" + "buckets"	"offsets"	"buckets"
byte	byte	byte	vbyte int	vbyte int	int[]	bucket[]

Note that the "offsets" store the starting location of all buckets beyond the first bucket (whose offset is known to be the end of the "offsets" position). The "offsets" are stored as plain integer values instead of vbyte encoded to allow for fast access of the bucket positions, but are probably a good candidate for delta encoded byte packing to further decrease their size.

Using it

This functionality can be utilized by a new property to IndexSpec, stringDictionaryEncoding, which can be set to {"type":"front-coded", "bucketSize": 4} or {"type":"front-coded", "bucketSize": 16} or something similar, to instruct indexing tasks to write segments with the compressed dictionaries with bucket size 4 or 16 respectively. ({"type":"utf8"} is the default).

This mode is not set it as the default yet because any segments written like this will be unreadable by older versions of Druid, so care must be taken before migrating to this encoding. Additionally, this needs a lot more testing and measurement to ensure that it is genuinely better in most cases before making it the default, but it looks pretty promising.

Testing

Besides the direct tests on FrontCodedIndexed and VByte, I also wired a front-coded segment into both BaseFilterTest and QueryTestHelper, which provides a rather wide set of test coverage for a variety of scenarios. This process found a number of bugs in my initial commits, so I feel reasonably confident that things are correct at this point.

Future work

Before I started coding on this, in addition to the paper linked in #3922, https://arxiv.org/pdf/1101.5506.pdf, I also read through https://arxiv.org/pdf/1911.08372.pdf which was a newer paper by one of the authors on the first link, and also stumbled upon https://link.springer.com/content/pdf/10.1007/s00778-020-00620-x.pdf, all of which detail additional (much fancier) improvements which can be made to this strategy by further coding the string values (https://en.wikipedia.org/wiki/Re-Pair seems the primary focus in these papers). It would probably be worth investigating to determine at what cost additional size improvements can be gained.

Additionally, it seems to be ideal to be able to vary which encoding is used per column instead of setting it at the segment level (this seems true of other types of compression as well...). This could allow collection of statistic at indexing time to determine how likely this encoding is to be useful, such as minimum value cardinality thresholds or similar (akin to the 'auto' encoding available for long columns).

Key changed/added classes in this PR

VByte
FrontCodedIndexed
FrontCodedIndexedWriter
StringDimensionMergerV9
DictionaryEncodedColumnPartSerde
DictionaryEncodedStringColumnIndexSupplier (internal classes split out into several new files)
StringFrontCodedDictionaryEncodedColumn
StringFrontCodedDictionaryEncodedColumnSupplier
StringFrontCodedColumnIndexSupplier
NestedDataColumnMerger
NestedDataColumnSupplier
CompressedNestedDataComplexColumn
NestedFieldLiteralColumnIndexSupplier
NestedFieldLiteralDictionaryEncodedColumn

This PR has:

[x] been self-reviewed.
[ ] added documentation for new or modified features or behaviors.
[x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
[x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
[x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
[x] been tested in a test Druid cluster.

Feb 23 '22 11:02 clintropolis

This pull request introduces 2 alerts when merging aadf6cda0e3b4405a44f493a9d59a0ccf886dde1 into b1640a72ee1090bd3c90240bf47a5bfd5b690f5a - view on LGTM.com

new alerts:

2 for Subtle call to inherited method

Feb 23 '22 13:02 lgtm-com[bot]

Some additional less scientific measurements, using a 10GB file of the nyc taxi dataset with all of the columns stored as strings: Screen Shot 2022-02-27 at 6 21 58 PM

grouping performance seems competitive: Screen Shot 2022-02-28 at 1 26 06 PM Screen Shot 2022-02-28 at 1 25 51 PM

select * does show a performance decrease as the earlier benchmarks suggested: Screen Shot 2022-02-27 at 6 28 15 PM Screen Shot 2022-02-27 at 6 27 41 PM

I still haven't had the chance to spend any time optimizing the code, but the size savings definitely make this feel worth considering for clusters where the typical workload does not include queries which hit a lot of columns like "wide" scans ("select *", etc) or group bys or things that hit a large number of columns.

Mar 01 '22 19:03 clintropolis

there aren't any explicit conflicts, but this PR needs updated after #12315 to implement the new methods, i'll try to see if I can consolidate some of the code a bit better when I fix it up

Mar 16 '22 02:03 clintropolis

This pull request introduces 1 alert when merging 28cf810ee57a32180ee471ab888a5b439de912ba into 331e6d707b003391d44ee583dbc328b2b6359425 - view on LGTM.com

new alerts:

1 for Inconsistent equals and hashCode

Sep 21 '22 13:09 lgtm-com[bot]

2.5m rows of data generated with https://github.com/lemire/StarSchemaBenchmark:

also (possibly coincidence) the smaller encoding seems to have a small positive effect on ingest task time: Screen Shot 2022-09-22 at 1 18 03 PM

Sep 22 '22 20:09 clintropolis

ran taxi set again with mixed types, it gets pretty good improvements from the date/time strings: Screen Shot 2022-09-21 at 8 04 17 PM

also sometimes the improvement is not a lot, pretty small overlap in this one: Screen Shot 2022-09-21 at 8 13 07 PM

Sep 22 '22 20:09 clintropolis

the slowdown on select * is also no longer quite as dramatic as in my first screenshots: Screen Shot 2022-09-22 at 1 50 23 PM Screen Shot 2022-09-22 at 1 49 59 PM

For reference, these were the first set of benchmarks when I opened the PR (because i have edited the PR description with the updated benchmarks):

Benchmark                                            (indexType)  (numElements)  (numOperations)  (width)  Mode  Cnt        Score     Error  Units
FrontCodedIndexedBenchmark.get                           generic          10000            10000       16  avgt    5       78.819 ±   2.541  ns/op
FrontCodedIndexedBenchmark.get:encoded size              generic          10000            10000       16  avgt    5   240010.000            bytes
FrontCodedIndexedBenchmark.get                     front-coded-4          10000            10000       16  avgt    5      220.887 ±  19.267  ns/op
FrontCodedIndexedBenchmark.get:encoded size        front-coded-4          10000            10000       16  avgt    5   169894.000            bytes
FrontCodedIndexedBenchmark.get                    front-coded-16          10000            10000       16  avgt    5      629.621 ±  59.667  ns/op
FrontCodedIndexedBenchmark.get:encoded size       front-coded-16          10000            10000       16  avgt    5   164428.000            bytes
FrontCodedIndexedBenchmark.indexOf                       generic          10000            10000       16  avgt    5     1228.012 ± 100.691  ns/op
FrontCodedIndexedBenchmark.indexOf:encoded size          generic          10000            10000       16  avgt    5   240010.000            bytes
FrontCodedIndexedBenchmark.indexOf                 front-coded-4          10000            10000       16  avgt    5     2101.530 ± 209.465  ns/op
FrontCodedIndexedBenchmark.indexOf:encoded size    front-coded-4          10000            10000       16  avgt    5   169908.000            bytes
FrontCodedIndexedBenchmark.indexOf                front-coded-16          10000            10000       16  avgt    5     2364.248 ± 156.352  ns/op
FrontCodedIndexedBenchmark.indexOf:encoded size   front-coded-16          10000            10000       16  avgt    5   164321.000            bytes
FrontCodedIndexedBenchmark.iterator                      generic          10000            10000       16  avgt    5       80.247 ±   1.639  ns/op
FrontCodedIndexedBenchmark.iterator:encoded size         generic          10000            10000       16  avgt    5   240010.000            bytes
FrontCodedIndexedBenchmark.iterator                front-coded-4          10000            10000       16  avgt    5      101.921 ±   6.225  ns/op
FrontCodedIndexedBenchmark.iterator:encoded size   front-coded-4          10000            10000       16  avgt    5   169889.000            bytes
FrontCodedIndexedBenchmark.iterator               front-coded-16          10000            10000       16  avgt    5      105.235 ±   2.333  ns/op
FrontCodedIndexedBenchmark.iterator:encoded size  front-coded-16          10000            10000       16  avgt    5   164221.000            bytes


FrontCodedIndexedBenchmark.get                           generic         100000            10000       16  avgt    5       95.947 ±   3.842  ns/op
FrontCodedIndexedBenchmark.get:encoded size              generic         100000            10000       16  avgt    5  2400010.000            bytes
FrontCodedIndexedBenchmark.get                     front-coded-4         100000            10000       16  avgt    5      239.630 ±  40.148  ns/op
FrontCodedIndexedBenchmark.get:encoded size        front-coded-4         100000            10000       16  avgt    5  1636718.000            bytes
FrontCodedIndexedBenchmark.get                    front-coded-16         100000            10000       16  avgt    5      648.001 ±  28.965  ns/op
FrontCodedIndexedBenchmark.get:encoded size       front-coded-16         100000            10000       16  avgt    5  1564861.000            bytes
FrontCodedIndexedBenchmark.indexOf                       generic         100000            10000       16  avgt    5     1612.575 ± 149.556  ns/op
FrontCodedIndexedBenchmark.indexOf:encoded size          generic         100000            10000       16  avgt    5  2400010.000            bytes
FrontCodedIndexedBenchmark.indexOf                 front-coded-4         100000            10000       16  avgt    5     2934.297 ± 176.182  ns/op
FrontCodedIndexedBenchmark.indexOf:encoded size    front-coded-4         100000            10000       16  avgt    5  1636652.000            bytes
FrontCodedIndexedBenchmark.indexOf                front-coded-16         100000            10000       16  avgt    5     3235.271 ± 438.586  ns/op
FrontCodedIndexedBenchmark.indexOf:encoded size   front-coded-16         100000            10000       16  avgt    5  1563940.000            bytes
FrontCodedIndexedBenchmark.iterator                      generic         100000            10000       16  avgt    5      787.253 ±  83.715  ns/op
FrontCodedIndexedBenchmark.iterator:encoded size         generic         100000            10000       16  avgt    5  2400010.000            bytes
FrontCodedIndexedBenchmark.iterator                front-coded-4         100000            10000       16  avgt    5     1048.532 ±  29.200  ns/op
FrontCodedIndexedBenchmark.iterator:encoded size   front-coded-4         100000            10000       16  avgt    5  1636546.000            bytes
FrontCodedIndexedBenchmark.iterator               front-coded-16         100000            10000       16  avgt    5     1054.068 ±  66.667  ns/op
FrontCodedIndexedBenchmark.iterator:encoded size  front-coded-16         100000            10000       16  avgt    5  1565019.000            bytes

Benchmark              (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt      Score      Error  Units
SqlBenchmark.querySql        4           5000000              none        false  avgt    5    152.079 ±    7.234  ms/op
SqlBenchmark.querySql        4           5000000              none        force  avgt    5     76.931 ±    3.635  ms/op
SqlBenchmark.querySql        4           5000000               fc4        false  avgt    5    151.430 ±   11.542  ms/op
SqlBenchmark.querySql        4           5000000               fc4        force  avgt    5     77.045 ±    3.972  ms/op
SqlBenchmark.querySql        4           5000000              fc16        false  avgt    5    160.475 ±   16.692  ms/op
SqlBenchmark.querySql        4           5000000              fc16        force  avgt    5     75.423 ±    4.572  ms/op
SqlBenchmark.querySql        5           5000000              none        false  avgt    5     49.356 ±    2.468  ms/op
SqlBenchmark.querySql        5           5000000              none        force  avgt    5     49.141 ±    2.853  ms/op
SqlBenchmark.querySql        5           5000000               fc4        false  avgt    5     49.169 ±    2.025  ms/op
SqlBenchmark.querySql        5           5000000               fc4        force  avgt    5     49.232 ±    2.801  ms/op
SqlBenchmark.querySql        5           5000000              fc16        false  avgt    5     49.438 ±    2.585  ms/op
SqlBenchmark.querySql        5           5000000              fc16        force  avgt    5     49.451 ±    3.924  ms/op
SqlBenchmark.querySql        6           5000000              none        false  avgt    5    199.922 ±    7.085  ms/op
SqlBenchmark.querySql        6           5000000              none        force  avgt    5    105.862 ±    7.491  ms/op
SqlBenchmark.querySql        6           5000000               fc4        false  avgt    5    209.565 ±   21.137  ms/op
SqlBenchmark.querySql        6           5000000               fc4        force  avgt    5    106.155 ±    7.080  ms/op
SqlBenchmark.querySql        6           5000000              fc16        false  avgt    5    199.692 ±   11.608  ms/op
SqlBenchmark.querySql        6           5000000              fc16        force  avgt    5    107.823 ±    9.708  ms/op
SqlBenchmark.querySql        7           5000000              none        false  avgt    5    154.474 ±    7.260  ms/op
SqlBenchmark.querySql        7           5000000              none        force  avgt    5     84.477 ±    4.757  ms/op
SqlBenchmark.querySql        7           5000000               fc4        false  avgt    5    150.924 ±    8.694  ms/op
SqlBenchmark.querySql        7           5000000               fc4        force  avgt    5     82.500 ±    4.569  ms/op
SqlBenchmark.querySql        7           5000000              fc16        false  avgt    5    151.421 ±    7.497  ms/op
SqlBenchmark.querySql        7           5000000              fc16        force  avgt    5     86.868 ±    3.030  ms/op
SqlBenchmark.querySql       10           5000000              none        false  avgt    5    780.820 ±   68.020  ms/op
SqlBenchmark.querySql       10           5000000              none        force  avgt    5    432.962 ±   62.454  ms/op
SqlBenchmark.querySql       10           5000000               fc4        false  avgt    5    798.933 ±   71.454  ms/op
SqlBenchmark.querySql       10           5000000               fc4        force  avgt    5    458.484 ±   37.257  ms/op
SqlBenchmark.querySql       10           5000000              fc16        false  avgt    5    872.225 ±   49.937  ms/op
SqlBenchmark.querySql       10           5000000              fc16        force  avgt    5    522.340 ±   35.679  ms/op
SqlBenchmark.querySql       12           5000000              none        false  avgt    5     69.159 ±    4.653  ms/op
SqlBenchmark.querySql       12           5000000              none        force  avgt    5     33.735 ±    2.004  ms/op
SqlBenchmark.querySql       12           5000000               fc4        false  avgt    5     69.006 ±    4.162  ms/op
SqlBenchmark.querySql       12           5000000               fc4        force  avgt    5     33.413 ±    1.792  ms/op
SqlBenchmark.querySql       12           5000000              fc16        false  avgt    5     68.738 ±    3.524  ms/op
SqlBenchmark.querySql       12           5000000              fc16        force  avgt    5     34.627 ±    2.654  ms/op
SqlBenchmark.querySql       19           5000000              none        false  avgt    5    700.311 ±  100.489  ms/op
SqlBenchmark.querySql       19           5000000              none        force  avgt    5    601.641 ±   39.403  ms/op
SqlBenchmark.querySql       19           5000000               fc4        false  avgt    5   1056.911 ±   91.692  ms/op
SqlBenchmark.querySql       19           5000000               fc4        force  avgt    5    963.832 ±   55.223  ms/op
SqlBenchmark.querySql       19           5000000              fc16        false  avgt    5   2159.028 ±  145.168  ms/op
SqlBenchmark.querySql       19           5000000              fc16        force  avgt    5   2107.042 ±  123.906  ms/op
SqlBenchmark.querySql       21           5000000              none        false  avgt    5    773.091 ±   46.505  ms/op
SqlBenchmark.querySql       21           5000000              none        force  avgt    5    459.661 ±   36.800  ms/op
SqlBenchmark.querySql       21           5000000               fc4        false  avgt    5    837.426 ±   41.612  ms/op
SqlBenchmark.querySql       21           5000000               fc4        force  avgt    5    504.546 ±   18.115  ms/op
SqlBenchmark.querySql       21           5000000              fc16        false  avgt    5    919.155 ±  106.868  ms/op
SqlBenchmark.querySql       21           5000000              fc16        force  avgt    5    602.116 ±    7.865  ms/op
SqlBenchmark.querySql       22           5000000              none        false  avgt    5     44.268 ±    2.707  ms/op
SqlBenchmark.querySql       22           5000000              none        force  avgt    5     43.919 ±    2.962  ms/op
SqlBenchmark.querySql       22           5000000               fc4        false  avgt    5     44.458 ±    3.590  ms/op
SqlBenchmark.querySql       22           5000000               fc4        force  avgt    5     43.738 ±    2.317  ms/op
SqlBenchmark.querySql       22           5000000              fc16        false  avgt    5     44.304 ±    2.876  ms/op
SqlBenchmark.querySql       22           5000000              fc16        force  avgt    5     43.906 ±    2.283  ms/op
SqlBenchmark.querySql       23           5000000              none        false  avgt    5  18260.991 ± 1041.530  ms/op
SqlBenchmark.querySql       23           5000000              none        force  avgt    5  18400.261 ± 1689.495  ms/op
SqlBenchmark.querySql       23           5000000               fc4        false  avgt    5  30704.667 ± 1624.680  ms/op
SqlBenchmark.querySql       23           5000000               fc4        force  avgt    5  30603.476 ±  621.976  ms/op
SqlBenchmark.querySql       23           5000000              fc16        false  avgt    5  48777.535 ± 3223.712  ms/op
SqlBenchmark.querySql       23           5000000              fc16        force  avgt    5  49966.818 ± 3739.684  ms/op

Sep 22 '22 20:09 clintropolis

We need to add some query benchmarks that include more realistic filter lists. Likely, we should have one query with an IN filter with 100 entries, another with 1000 entries and a third with 10,000 entries. If we want to vary it, we should run them each with different "hit ratios", but for now, we can perhaps give them a 20% overlap with actual values that exist in the dictionaries. If we do that, I suspect that we will start exercising some of the code paths that I'm curious about.

This is currently sort of provided by the FrontCodedIndexedBenchmark, which i think is the most appropriate place to measure that directly, though I only did 10k entries and didn't spend much time with adding non-existent values since I wasn't targeting specifically the in filter. I don't think the general SQL benchmark is the most appropriate place to go that in depth on specific filters, and in my experience, most of the IN filters with 1k+ entries I have seen are usually machine generated native queries rather than SQL, since the overhead of parsing that much text into SQL and then planning also starts to stick out.

Sep 26 '22 22:09 clintropolis

On the IN filter benchmarks, I've definitely seen client-generated queries with large IN filters. Looker generates them sometimes, other tools also generate them and there are definitely applications that generate them. It is programmatically generated, yes, but it also comes from external sources. I forget if we've fixed this or not yet, but this is common enough that we have (or had?) a known issue around SQL parsing of large IN filters where it first puts them all into ORs before converting back to IN and our planning code for some reason likes to do O(n^2) passes over ORs. Having a benchmark that also covers that case would be nice.

When I look at the massive UNION ALL query in query 19, there is an actual difference between the timings for that query between GenericIndexed and the FrontCodedIndex. I just wonder if that's because of the IN clause at the very end of that query...

Just looked again at the FrontCodedIndexBenchmark and I see the GenericIndexed in there too. I can't help but wonder if the differences there are actually from the fact that FrontCoded is dealing with UTF8 bytes instead of String. I.e. all of the glut of dealing with String is equivalent to the glut of the extra objects and they are just equaling each other out. If that's the case, then there's even more benefit to be had from the FrontCodedIndexBenchmark. The other thing that's much more difficult to measure, but that we have seen is that garbage invokes GC and GC stops all queries from making progress. This can limit the overall level of concurrency of the distributed system and can be difficult to measure in a benchmark.

Sep 27 '22 00:09 imply-cheddar

Just looked again at the FrontCodedIndexBenchmark and I see the GenericIndexed in there too. I can't help but wonder if the differences there are actually from the fact that FrontCoded is dealing with UTF8 bytes instead of String. I.e. all of the glut of dealing with String is equivalent to the glut of the extra objects and they are just equaling each other out. If that's the case, then there's even more benefit to be had from the FrontCodedIndexBenchmark.

Yeah, I don't think FrontCodedIndexed has that much more garbage than regular string using GenericIndexed, a bit more - but not a ton. I would agree that there is a lot of room for improvement (probably with GenericIndexed string columns too). In FrontCodedIndexBenchmark, the FrontCodedIndexed is converting values to string and from bytebuffer as well, so it should be apples to apples I think, mostly. My earlier implementation of FrontCodedIndexed was Indexed<String>, since this PR predated #12517, you can see the older numbers here https://github.com/apache/druid/pull/12277#issuecomment-1255538371

Sep 27 '22 02:09 clintropolis

i spent the last couple days optimizing this heavily and it seems to have paid off, as I've managed to greatly minimize almost all of the performance difference between this and standard GenericIndexed based string columns, it looks acceptably higher at this point for the size savings. I've updated the benchmarks in the PR description to reflect this.

Sep 29 '22 09:09 clintropolis

I ran a few of the SqlNestedDataBenchmarks that were doing string stuff just to spot check and things look good there too:

SELECT string1, SUM(long1) FROM foo GROUP BY 1 ORDER BY 2
SELECT JSON_VALUE(nested, '$.nesteder.string1'), SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo GROUP BY 1 ORDER BY 2

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score   Error  Units
SqlNestedDataBenchmark.querySql        6           5000000              none        false  avgt    5  229.141 ± 4.357  ms/op
SqlNestedDataBenchmark.querySql        6           5000000              none        force  avgt    5  158.286 ± 1.982  ms/op
SqlNestedDataBenchmark.querySql        6           5000000     front-coded-4        false  avgt    5  226.019 ± 2.990  ms/op
SqlNestedDataBenchmark.querySql        6           5000000     front-coded-4        force  avgt    5  154.666 ± 0.682  ms/op
SqlNestedDataBenchmark.querySql        6           5000000    front-coded-16        false  avgt    5  218.805 ± 3.507  ms/op
SqlNestedDataBenchmark.querySql        6           5000000    front-coded-16        force  avgt    5  159.220 ± 9.396  ms/op
SqlNestedDataBenchmark.querySql        7           5000000              none        false  avgt    5  379.591 ± 6.253  ms/op
SqlNestedDataBenchmark.querySql        7           5000000              none        force  avgt    5  196.781 ± 3.562  ms/op
SqlNestedDataBenchmark.querySql        7           5000000     front-coded-4        false  avgt    5  369.041 ± 4.383  ms/op
SqlNestedDataBenchmark.querySql        7           5000000     front-coded-4        force  avgt    5  197.589 ± 3.049  ms/op
SqlNestedDataBenchmark.querySql        7           5000000    front-coded-16        false  avgt    5  379.980 ± 2.840  ms/op
SqlNestedDataBenchmark.querySql        7           5000000    front-coded-16        force  avgt    5  198.248 ± 4.503  ms/op

SELECT SUM(long1) FROM foo WHERE string1 = '10000' OR string1 = '1000'
SELECT SUM(JSON_VALUE(nested, '$.long1' RETURNING BIGINT)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') = '10000' OR JSON_VALUE(nested, '$.nesteder.string1') = '1000'

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score   Error  Units
SqlNestedDataBenchmark.querySql       10           5000000              none        false  avgt    5   11.487 ± 0.236  ms/op
SqlNestedDataBenchmark.querySql       10           5000000              none        force  avgt    5   11.472 ± 0.201  ms/op
SqlNestedDataBenchmark.querySql       10           5000000     front-coded-4        false  avgt    5   11.509 ± 0.198  ms/op
SqlNestedDataBenchmark.querySql       10           5000000     front-coded-4        force  avgt    5   11.510 ± 0.297  ms/op
SqlNestedDataBenchmark.querySql       10           5000000    front-coded-16        false  avgt    5   11.480 ± 0.288  ms/op
SqlNestedDataBenchmark.querySql       10           5000000    front-coded-16        force  avgt    5   11.458 ± 0.270  ms/op
SqlNestedDataBenchmark.querySql       11           5000000              none        false  avgt    5   11.650 ± 0.274  ms/op
SqlNestedDataBenchmark.querySql       11           5000000              none        force  avgt    5   11.674 ± 0.254  ms/op
SqlNestedDataBenchmark.querySql       11           5000000     front-coded-4        false  avgt    5   11.681 ± 0.312  ms/op
SqlNestedDataBenchmark.querySql       11           5000000     front-coded-4        force  avgt    5   11.672 ± 0.340  ms/op
SqlNestedDataBenchmark.querySql       11           5000000    front-coded-16        false  avgt    5   11.792 ± 0.383  ms/op
SqlNestedDataBenchmark.querySql       11           5000000    front-coded-16        force  avgt    5   11.809 ± 0.422  ms/op


SELECT long1, SUM(double3) FROM foo WHERE string1 = '10000' OR string1 = '1000' GROUP BY 1 ORDER BY 2
SELECT JSON_VALUE(nested, '$.long1' RETURNING BIGINT), SUM(JSON_VALUE(nested, '$.nesteder.double3' RETURNING DOUBLE)) FROM foo WHERE JSON_VALUE(nested, '$.nesteder.string1') = '10000' OR JSON_VALUE(nested, '$.nesteder.string1') = '1000' GROUP BY 1 ORDER BY 2

Benchmark                        (query)  (rowsPerSegment)  (stringEncoding)  (vectorize)  Mode  Cnt    Score   Error  Units
SqlNestedDataBenchmark.querySql       16           5000000              none        false  avgt    5  126.009 ± 1.829  ms/op
SqlNestedDataBenchmark.querySql       16           5000000              none        force  avgt    5  125.930 ± 2.802  ms/op
SqlNestedDataBenchmark.querySql       16           5000000     front-coded-4        false  avgt    5  125.991 ± 1.981  ms/op
SqlNestedDataBenchmark.querySql       16           5000000     front-coded-4        force  avgt    5  126.098 ± 4.202  ms/op
SqlNestedDataBenchmark.querySql       16           5000000    front-coded-16        false  avgt    5  125.795 ± 6.560  ms/op
SqlNestedDataBenchmark.querySql       16           5000000    front-coded-16        force  avgt    5  126.172 ± 3.807  ms/op
SqlNestedDataBenchmark.querySql       17           5000000              none        false  avgt    5  126.375 ± 2.382  ms/op
SqlNestedDataBenchmark.querySql       17           5000000              none        force  avgt    5  125.585 ± 0.396  ms/op
SqlNestedDataBenchmark.querySql       17           5000000     front-coded-4        false  avgt    5  125.678 ± 2.668  ms/op
SqlNestedDataBenchmark.querySql       17           5000000     front-coded-4        force  avgt    5  125.355 ± 2.104  ms/op
SqlNestedDataBenchmark.querySql       17           5000000    front-coded-16        false  avgt    5  127.011 ± 5.057  ms/op
SqlNestedDataBenchmark.querySql       17           5000000    front-coded-16        force  avgt    5  126.835 ± 3.172  ms/op

Segment sizes for benchmark are ~3.4G instead of 3.6G, but similar issue with the SqlBenchmark using the data generator in that the data is all stringified numbers so limited in ability to benefit from this encoding.

Sep 29 '22 23:09 clintropolis

Got it. Thanks for explaining.

On Sun, 23 Oct 2022 at 7:13 AM, Clint Wylie @.***> wrote:

@.**** commented on this pull request.

In processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnSupplier.java https://github.com/apache/druid/pull/12277#discussion_r1002613660:
       dictionary = GenericIndexed.read(stringDictionaryBuffer, GenericIndexed.BYTE_BUFFER_STRATEGY, mapper);
       frontCodedDictionary = null;
Ah, so I decided to just make it a Supplier for users of FrontCodedIndexed so that I don't have to do something like what nearly all of the actual users of GenericIndexed are doing, which is call to singleThreaded() on the top level column GenericIndexed dictionary to get an optimized version that isn't thread-safe for use in a single thread so that it can create less garbage. By using a supplier I can avoid all this since all threads just get their own copy (which is basically what all callers are doing anyway).

— Reply to this email directly, view it on GitHub https://github.com/apache/druid/pull/12277#discussion_r1002613660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALIWUIFVTY7NILM2AKCT7TWESJ53ANCNFSM5PEBYT4A . You are receiving this because you commented.Message ID: @.***>

Oct 23 '22 07:10 abhishekagarwal87

druid druid copied to clipboard

add support for 'front coded' string dictionaries for smaller string columns

Description

Segment Size and Performance

Design

Using it

Testing

Future work

Key changed/added classes in this PR

@.**** commented on this pull request.

druid
druid copied to clipboard