velox
velox copied to clipboard
VectorFuzzer: Add support for uneven dictionary generation
VectorFuzzer can generate dictionaries with multiple levels, but today the indirection and base vectors always have the same size. To increase entropy, let's extend it to allow the generation of dictionaries where the base vector size could be smaller or larger than the indirection size.
Relevant code pointer:
https://github.com/facebookincubator/velox/blob/main/velox/vector/fuzzer/VectorFuzzer.cpp#L391
Hi @pedroerp,
I am new to velox so I am not familiar with the code base at all. However, since the issue is labeled as "good first" I took a look at the code you pointed. First I don't understand the following check
https://github.com/facebookincubator/velox/blob/2c733afd03e82b6d52f83871036409a55e0d7584/velox/vector/fuzzer/VectorFuzzer.cpp#L395-L397
Should not it be vectorSize == 0 || size > 0
to be consistent with the error message?
Anyway, I am not getting you request either: what do you mean with indirection size? I thought that size
and vectorSize
are needed to be equal, but that's not the case as I reckon from the BaseVector::wrapInDictionary()
method.
Maybe you mean that here
https://github.com/facebookincubator/velox/blob/2c733afd03e82b6d52f83871036409a55e0d7584/velox/vector/fuzzer/VectorFuzzer.cpp#L401-L403
% vectorSize
can actually be something smaller or larger?
Can you provide more details please? If you think that this request is too much for a novice, please let me know, so that I'd better shift to other issues :).
Thank you!
Should not it be vectorSize == 0 || size > 0 to be consistent with the error message?
Hi @randosrandom . The assertion is a bit hard to read, but as far as I can tell the error message is correct. This should only fail in the case where the inner vector is empty (vectorSize > 0 is false), and the indirection size is not empty (size == 0 is false).
Anyway, I am not getting you request either: what do you mean with indirection size? I thought that size and vectorSize are needed to be equal, but that's not the case as I reckon from the BaseVector::wrapInDictionary() method.
They don't need to be equal (and most of the time they aren't). Dictionaries are usually used to express cardinality increase or reduction, in which cases the indirection vector is larger or smaller than the inner vector, respectively.
With that said, this is the function that would need to be changed:
https://github.com/facebookincubator/velox/blob/main/velox/vector/fuzzer/VectorFuzzer.cpp#L248-L284
Currently, it generates a vector, and it might call fuzzDictionary() to wrap it in a dictionary of the same size (minus the small slice adjustment). We should somehow ensure that in some cases the inner vector is smaller or larger than the indirection that will be used to wrap it.
@pedroerp I used to add a dictionary-encoded vector with a greater size than its base vector in ArrayContainsTest.dictionaryEncodingElements. This requirement should be similar to it by applying the following change. Is my understanding correct?
VectorPtr VectorFuzzer::fuzzDictionary(const VectorPtr& vector) {
- return fuzzDictionary(vector, vector->size());
+ if (coinToss(opts_.nullRatio)) {
+ return fuzzDictionary(vector, (rand<vector_size_t>(rng_) % 3) * vector->size());
+ } else {
+ return fuzzDictionary(vector, vector->size() / ((rand<vector_size_t>(rng_) % 3) + 1));
+ }
Like we discussed on Slack, something like this should work. We would just need to give it a long enough fuzzer run to make sure it won't trigger any dormant bugs in the codebase.