add kwarg chunksize for default data partitioning for write
This PR proposes to introduce automated partitioning of the provided tables when writing. It follows my findings from benchmarking against PyArrow
Nowadays, most machines are multithreaded and Arrow.write() provides multithreaded writing for partitioned data. However, a user must explicitly partition their data.
Unfortunately, most users do not realize that both their write and subsequent read operations will not be multithreaded without such partitioning (there is an issue to improve the docs).
This PR defaults to partitioning data if it's larger than 64K rows (should be beneficial on most systems) to enable better Arrow.jl performance on both read and write.
Implementation:
- the new kwarg is called
chunksize(maps to PyArrow and should be broadly understood) - uses default
chunksizeof 64000 rows, as perPyArrow.write_feather - allows users to opt-out by providing
chunksize=nothing - ~~partitioning is done via
Iterators.partition(Tables.rows(tbl),chunksize)for all Tables.jl-compatible sources (checksTables.istable)~~ changed toIterators.partition(tbl,chunksize)to avoid missingness getting lost (eg, for DataFrames)
Some resources:
- PyArrow write_feather docs
- Dataframes.jl introduces overload for
Iterators.partitionin 1.5 Release - Arrow.jl author's blog post on partioning (sidenote: I really enjoy your posts - please write more :-) )
I've changed the condition for automatic partitioning to be Tables.rowaccess()=true as well, to prevent accepting some columntables without row iterators.
In addition, I've changed Iterators.partition(Tables.rows(tbl),chunksize) to Iterators.partition(tbl,chunksize) to avoid missingness type getting lost (eg, for DataFrames)
In addition, I've changed Iterators.partition(Tables.rows(tbl),chunksize) to Iterators.partition(tbl,chunksize) to avoid missingness type getting lost (eg, for DataFrames)
Okay, I was wrong. I misunderstood what rowaccess requirements are -- Iterators.partition() still needs to be defined separately.
I've moved back to Tables.rows to ensure we get rows out.
I'm not sure what the best solution is here.
By far the simplest option would be to pass the schema down - because we have access to it before Tables.columns is called in the arrow construction (that's how we lose the schema, because we "materialize" the chunk as is)
EDIT:
- The second simplest option would be to add compat entry for DataFrames>1.5.0 and use the Iterators.partition directly (with some safety check that it indeed chunked rows, not columns... if some unknown type defines it over columns)
Added compat for DataFrames via Extras
By far the simplest option would be to pass the schema down - because we have access to it before
Tables.columnsis called in the arrow construction (that's how we lose the schema, because we "materialize" the chunk as is)
We could allow users to optionally provide the Schema in the Base.open constructor of the Writer object. If a user makes use of this then we should validate the the actual schema of each chunk matches that of the expected schema.