sensAI icon indicating copy to clipboard operation
sensAI copied to clipboard

"Vectorize" ColumnGeneratorCachedByIndex

Open MischaPanch opened this issue 11 months ago • 2 comments

The ColumnGeneratorCachedByIndex is recommended for new cached column generators, but it can be significantly slower than the not-recommended way of first creating a ColumnGenerator and then adding cache by wrapping with IndexCachedColumnGenerator.

The reason is that IndexCachedColumnGenerator will find all non-cached values and then process them at once (i.e., batch-wise), whereas the ColumnGeneratorCachedByIndex will always loop through all values. Thus, for an initial filling of the cache this can be much slower.

Not sure what to do here - one would need to redesign the ColumnGeneratorCachedByIndex to not use _generate_value, but that's a breaking change. Another way would be to write a new class a la VectorizedColumnGeneratorCachedByIndex, but I honestly feel like batch-wise processing of missing values should be the default behavior

MischaPanch avatar Feb 27 '24 11:02 MischaPanch