kartothek icon indicating copy to clipboard operation
kartothek copied to clipboard

Re-write predicate involving `in` operator to use disjunction of `==` terms

Open lr4d opened this issue 5 years ago • 3 comments

Problem description

We use the in operator internally in predicate parsing, but we can just re-write the predicates to use a disjunction of == terms. e.g. [[('A', 'in', [1, 4, 9, 13])]] -> [[('A', '==', 1)], [('A', '==', 4)], [('A', '==', 9)], [('A', '==', 13)]]

We could implement this re-write when a user passes predicates involving in, before the predicates are evaluated. This seems to be as fast as or faster than our current evaluation of predicates in micro-benchmarks (see below).

Example code (ideally copy-pastable)

import pyarrow as pa
import numpy as np
from tempfile import TemporaryDirectory
from storefact import get_store_from_url
from functools import partial
store_factory = partial(get_store_from_url, f"hfs://{TemporaryDirectory().name}")
dataset_uuid = "test"
import pandas as pd
df = pd.DataFrame({"A": range(10), "B": ["A", "B"] * 5, "C": [np.nan, *range(-10, -1)]})
from kartothek.io.eager import read_table, store_dataframes_as_dataset
dm = store_dataframes_as_dataset(
    store=store_factory, dataset_uuid=dataset_uuid, dfs=[df]*100, # partition_on=["A", "B"]
)

store_dataframes_as_dataset(
    store=store_factory, dataset_uuid="part", dfs=[df]*100, partition_on=["A", "B"]
)
from kartothek.io.eager import read_dataset_as_metapartitions
from kartothek.io_components.read import dispatch_metapartitions_from_factory

target = [1, 4, 9, 13]
predicates_in = [[("A", "in", target)]]
predicates_normal = [[("A", "==", n)] for n in target]

from kartothek.core.factory import DatasetFactory
f = DatasetFactory(dataset_uuid=dataset_uuid, store_factory=store_factory)
f_part = DatasetFactory(dataset_uuid="part", store_factory=store_factory)

%timeit dispatch_metapartitions_from_factory(f, predicates=predicates_in)
# 61 µs ± 9.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit dispatch_metapartitions_from_factory(f, predicates=predicates_normal)
# 50.7 µs ± 2.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit dispatch_metapartitions_from_factory(f_part, predicates=predicates_in)
# 51 µs ± 2.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit dispatch_metapartitions_from_factory(f_part, predicates=predicates_normal)
# 50.3 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


lr4d avatar Jul 28 '20 10:07 lr4d