velox icon indicating copy to clipboard operation
velox copied to clipboard

VectorFuzzer: Add support for uneven dictionary generation

Open pedroerp opened this issue 2 years ago • 2 comments

VectorFuzzer can generate dictionaries with multiple levels, but today the indirection and base vectors always have the same size. To increase entropy, let's extend it to allow the generation of dictionaries where the base vector size could be smaller or larger than the indirection size.

Relevant code pointer:

https://github.com/facebookincubator/velox/blob/main/velox/vector/fuzzer/VectorFuzzer.cpp#L391

pedroerp avatar Sep 16 '22 17:09 pedroerp

Hi @pedroerp,

I am new to velox so I am not familiar with the code base at all. However, since the issue is labeled as "good first" I took a look at the code you pointed. First I don't understand the following check https://github.com/facebookincubator/velox/blob/2c733afd03e82b6d52f83871036409a55e0d7584/velox/vector/fuzzer/VectorFuzzer.cpp#L395-L397 Should not it be vectorSize == 0 || size > 0 to be consistent with the error message?

Anyway, I am not getting you request either: what do you mean with indirection size? I thought that size and vectorSize are needed to be equal, but that's not the case as I reckon from the BaseVector::wrapInDictionary() method. Maybe you mean that here https://github.com/facebookincubator/velox/blob/2c733afd03e82b6d52f83871036409a55e0d7584/velox/vector/fuzzer/VectorFuzzer.cpp#L401-L403 % vectorSize can actually be something smaller or larger?

Can you provide more details please? If you think that this request is too much for a novice, please let me know, so that I'd better shift to other issues :).

Thank you!

randosrandom avatar Sep 17 '22 20:09 randosrandom

Should not it be vectorSize == 0 || size > 0 to be consistent with the error message?

Hi @randosrandom . The assertion is a bit hard to read, but as far as I can tell the error message is correct. This should only fail in the case where the inner vector is empty (vectorSize > 0 is false), and the indirection size is not empty (size == 0 is false).

Anyway, I am not getting you request either: what do you mean with indirection size? I thought that size and vectorSize are needed to be equal, but that's not the case as I reckon from the BaseVector::wrapInDictionary() method.

They don't need to be equal (and most of the time they aren't). Dictionaries are usually used to express cardinality increase or reduction, in which cases the indirection vector is larger or smaller than the inner vector, respectively.

With that said, this is the function that would need to be changed:

https://github.com/facebookincubator/velox/blob/main/velox/vector/fuzzer/VectorFuzzer.cpp#L248-L284

Currently, it generates a vector, and it might call fuzzDictionary() to wrap it in a dictionary of the same size (minus the small slice adjustment). We should somehow ensure that in some cases the inner vector is smaller or larger than the indirection that will be used to wrap it.

pedroerp avatar Oct 07 '22 03:10 pedroerp

@pedroerp I used to add a dictionary-encoded vector with a greater size than its base vector in ArrayContainsTest.dictionaryEncodingElements. This requirement should be similar to it by applying the following change. Is my understanding correct?

 VectorPtr VectorFuzzer::fuzzDictionary(const VectorPtr& vector) {
-  return fuzzDictionary(vector, vector->size());
+  if (coinToss(opts_.nullRatio)) {
+    return fuzzDictionary(vector, (rand<vector_size_t>(rng_) % 3) * vector->size());
+  } else {
+    return fuzzDictionary(vector, vector->size() / ((rand<vector_size_t>(rng_) % 3) + 1));
+  }

duanmeng avatar Sep 08 '23 17:09 duanmeng

Like we discussed on Slack, something like this should work. We would just need to give it a long enough fuzzer run to make sure it won't trigger any dormant bugs in the codebase.

pedroerp avatar Sep 20 '23 23:09 pedroerp