TextParse.jl icon indicating copy to clipboard operation
TextParse.jl copied to clipboard

Add support for PooledArray back into the package.

Open davidanthoff opened this issue 7 years ago • 3 comments

This commit removed support for it entirely, not clear to me why. There are some very legit usecases where I think it would be great to get PooledArrays, so I would be in favor of just adding that feature back in. Perfectly fine to have it behind an optional switch.

davidanthoff avatar Nov 01 '18 16:11 davidanthoff

The thinking was promotion to PooledArrays of different Ref types (UInt8 .. UInt64) was too expensive in terms of real csv read performance (you only want to read a CSV first, if you're compiling the function 4 times for every string column, that's bad).

In that context it only makes sense to convert string columns to PooledArrays after the column has been read as a StringVector. At which point you might as well do it outside TextParse (e.g. in JuliaDB)

shashi avatar Nov 01 '18 18:11 shashi

Why not default to UInt32 like CategoricalArrays? That's always a net win in terms of storage, and that suits 99% of use cases. Anyway, that's just an option, people can do something else if they need that.

The advantage of parsing directly to PooledArray is that you don't need to store so many identical strings in memory until you convert the column.

nalimilan avatar Nov 01 '18 18:11 nalimilan

Another thing that would be neat is to be able to specify this separately for different columns.

We are planning/thinking a lot about whole query optimization in Query.jl right now, and I can pretty easily see a scenario where for example something like load("foo.csv") |> @groupby(_.a) |> ... ends up automatically only loading the a column as a PooledArray, but not anything else, based on the analysis of a given query (as a whole, not just an individual query operation). I don't think any of this will appear in the next couple of weeks (or months) in Query.jl, but we have a team here now that is hopefully going to tackle this (pretty big) issue, and so starting to add features like that here might get another good payoff down the road.

davidanthoff avatar Nov 01 '18 19:11 davidanthoff