pandas2 icon indicating copy to clipboard operation
pandas2 copied to clipboard

DESIGN: Cheaper DataFrame.append

Open wesm opened this issue 8 years ago • 3 comments

I'm thinking we can come up with a plan to yield a better .append implementation that defers stitching together arrays until it's actually needed for computations.

We can do this by having a virtual pandas::Table interface that will consolidate fragmented columns only when they are requested. Will think some more about this

wesm avatar Oct 18 '16 17:10 wesm

The obvious alternative is to allow pandas objects to backed by dynamic arrays. This is possible now that we require arrays to 1D and contiguous.

This has the advantage of still using eager evaluation, so you don't need to build machinery for differed evaluation. Also, you still get predictable performance, even if you inspect the array in between appends. I would guess looking at DataFrames being appended piece-by-piece is pretty common, even if only to check the size.

The downside is that this wouldn't really work with the current interface, because such appends need to in-place. Also, dynamic arrays reduce speed and increase memory requirements by small constant multiples.

Maybe it would make sense to deprecate DataFrame.append and instead make an alternative DynamicDataFrame (sub?)class that does an in-place append?

shoyer avatar Oct 18 '16 19:10 shoyer

We could definitely have a mutating append and write into resizeable buffers (with growth factor 1.5 or 2). Something we can experiment with

wesm avatar Oct 20 '16 21:10 wesm

related this this, I think enlargement via an indexer, a bit too magical / not-transparanet / non-performant, unless you have a growable buffer.

Better to be explicit at a small loss of convenience in syntax.

jreback avatar Jan 26 '17 00:01 jreback