bcolz icon indicating copy to clipboard operation
bcolz copied to clipboard

pandas out_flavor for ctable

Open ARF1 opened this issue 10 years ago • 4 comments

Closes #176. Simplifies implementation of #66.

Summary:

  • introduction of an abstraction layer for the "results array"
  • implementation of a numpy specialisation of the abstraction layer
  • implementation of a pandas specialisation of the abstraction layer

This is a quick hack to demonstrate the possible performance gains by using a output flavor with column major ordering, here: the pandas dataframe.

The architecture would need to be improved upon since this implementation suffers a x3-4 performance penalty for db[1] -type queries due to increased python overhead. For queries returning a larger number of rows this penalty disappears.

Timing results in #176.

ARF1 avatar May 03 '15 14:05 ARF1

Would you mind to add some benchmarks in the 'bench/' directory showing the advantage of this approach? My idea is to setup a speed regression check based on different benchmarks there. Thanks!

FrancescAlted avatar May 05 '15 17:05 FrancescAlted

@FrancescAlted

Would you mind to add some benchmarks in the 'bench/' directory showing the advantage of this approach?

I would be happy to. I just need to clarify what you are looking for:

This PR (pandas out_flavor) was only intended as a proof-of-concept, it was not really intended for inclusion in the code-base. The architecture of the more general #187 (abstraction layer) is more performant (and easier to read).

Would you like me to provide a sample implementation of a pandas "out_flavor" for the new #187 (abstraction layer) instead and a benchmark for that? I.e. with a benchmark in analogy to bench\getitem.py.

Or would you like a "rawer" benchmark, avoiding __getitem__() (and its overhead) showing only the best possible performance for filling a pandas dataframe? Sort of like bench\pandas-todataframe.py does?

ARF1 avatar May 05 '15 17:05 ARF1

@FrancescAlted On reflection, I probably was not as clear as I could have been: when you speak of "this approach", do you mean

  • the column-major (vs. row-major) result array in isolation or
  • the abstraction layer (in whatever version) plus the pandas out-flavor implementation (vs. the current non-abstracted out flavor)?

ARF1 avatar May 05 '15 20:05 ARF1

What do you want us to do with the pull-request?

esc avatar May 23 '15 04:05 esc