Florian Jetter
Florian Jetter
The current format specification was built with a multi table dataset in mind and caries a lot of redundancies. This issue should collect requirements for a new `metadata_version=5` specification. xref...
The performance of the `io.cube.test_query` module is concerning and takes up the majority of runtime of the entire test suite. While this is a cornerstone of the cube functionality, the...
### Problem description The physical layout and indexing of the dataset dominantly impacts read performances. Often dataset are designed in such a way to support a rather specific use case...
Arrow introduces two options which are supposedly helping with memory conservation. `self_destruct` frees a converted column as soon as it is converted which renders the pa.Table object useless after the...
### Problem description The initial design for the indices where based on a version of this library where only single value, equality queries could be performed, see [here](https://github.com/JDASoftwareGroup/kartothek/blob/61ce401512e3a46969f1db56e2d2eec2f0c5b334/kartothek/core/dataset.py#L286). This motivated...
### Problem description The usage of an index build pipeline `build_dataset_indices__bag` may build indices of incompatible types when building an index for a date type column, leaving the dataset in...
### Problem description The hypothesis index tests are causing an OutOfBounds exception E.g. https://travis-ci.com/JDASoftwareGroup/kartothek/jobs/254799676
We currently only support the deprecated query syntax for deletion scopes. It would be more intuitive to specify the deletion scope using the predicate syntax. Old syntax ``` update_dataset_from_ddf( new_ddf,...
### Problem description The current partition pruning mechanism relies on the `Index.eval_operator` method which *always* converts the dictionary to an array before evaluating the predicates. This conversion takes up most...
We're using Apache Arrow as the ultimate tool to glue everything together. When writing data, we accept pandas dataframes, convert them to arrow tables and store them as parquet. When...