ibis
ibis copied to clipboard
feat(parquet): figure out convention for multi-file parquet writing
Hmmm, I remember us talking about this before (sorry, this issue had slipped my mind when we talked about this last week).
Currently all our to_csv
/to_parquet
writers produce a single file (rather than a directory of files). There was an open question as to what the expected behavior was for backends where a single-file output is tricky/inefficient/impossible (backends like spark or dask).
Enumerating all the options I can think of:
1. to_csv
/to_parquet
always outputs a single file
We could fallback to pyarrow or error for backends where this is tricky. This is the current behavior.
2. to_csv
/to_parquet
always output a directory
Backends that only write a single file would convert to writing a directory with a single file. This would be a breaking change, but should work with any backend's native csv/parquet writer. This would make a common case of writing a small csv of results trickier though.
3. to_csv
/to_parquet
write the backend-native style
The output style (single file or directory) depends on the backend. This is easy to implement, but would mean different backends would result in different behaviors.
4. to_csv
/to_parquet
have an option to write a directory
These methods would write to a single file by default, but would have an option to instead write to a directory. :shrug: whether this is a directory=True
option, or inferred somehow from the input path. This would mean that users using backends that can only efficiently write to directories may need to opt into this support, but would make things explicit about the output type.
t.to_csv("test.csv") # would error for the spark backend? Or fallback to pyarrow?
t.to_csv("test/", directory=True) # output a directory can use pyspark native behavior
5. New methods for writing directories
to_csv
/to_parquet
keep their existing behavior, and we add new to_csv_dir
/to_parqet_dir
(or much better named) methods for writing directories. Same caveats/questions as 4.
Right now I'm leaning towards 4. An extra flag seems fine to me, and I like it better than adding another top-level-method per file format just to support partitioned writing. cc'ing @gforsyth for a 2nd opinion though, since he may remember what conclusions we came to last time this came up.
Originally posted by @jcrist in https://github.com/ibis-project/ibis/issues/6615#issuecomment-1629759554
Additionally, I've documented (although this is now at least partly out of date) how a few systems handle various parquet partitioning schemes: https://gist.github.com/gforsyth/8dd4ca981b2beed6ef4db80f5e8afbfd
Opening this issue so we have something to track that isn't a comment in a closed PR.
Probably need to write out a taxonomy of what parquet functionality is supported by each backend natively
Right now I'm leaning towards 4.
I agree this seems reasonable. However, taking the PySpark example—it would be weird if user behaviors end up being driven by the backend. It sounds like, a PySpark user would often want to specify directory=True
to leverage the native path, while locally (e.g. using DuckDB) they may want to leave directory=False
. It feels smoother that directory=False
on PySpark would at least work, but the impression a user would have is that they're choosing output format, when in reality they're also choosing between a native and PyArrow-based bath.
Right now I'm leaning towards 4.
I agree this seems reasonable. However, taking the PySpark example—it would be weird if user behaviors end up being driven by the backend. It sounds like, a PySpark user would often want to specify
directory=True
to leverage the native path, while locally (e.g. using DuckDB) they may want to leavedirectory=False
. It feels smoother thatdirectory=False
on PySpark would at least work, but the impression a user would have is that they're choosing output format, when in reality they're also choosing between a native and PyArrow-based bath.
During triage earlier today, we collectively decided that option 5 makes the most sense. By having two separate APIs, we surface the directory option more obviously to users.
The directory option only applies for the write path; for reading, we will still maintain a single API. Furthermore, we will support directory vs. single file options, but not get into more backend-specific behavior (like glob handling); that will still be delegated to the backend.
Finally, it came up that at some point we may need to better support cloud I/O (e.g. via fsspec
), but this will be put off until we get more users asking for it.
On the topic of unifying behaviors across backends, is it weird if some backends require that, e.g., read_csv()
be pointing to directory paths and other backends can read in individual csv files?
I'm looking into this one for duckdb which is what's missing implementing since the pyspark case was covered in https://github.com/ibis-project/ibis/pull/9272.
I noticed that for the case of writing hive partitions. if you do something like
>>> penguins = ibis.examples.penguins.fetch()
>>> con = ibis.get_backend(penguins)
>>> con.to_parquet(penguins, "my_dir" , partition_by="year")
That will create a directory called my_dir
at the current location, and then the subsequent partition directories, in this case year=2007 year=2008 year=2009
with the respective parquet file in them.
So for the case of hive partitions we kind of are supporting writing to a directory.
Then the question is do we want a to_parquet_dir
(based on this comment https://github.com/ibis-project/ibis/issues/8584#issuecomment-2057992081) method just to cover the case where you want to do
con.to_parquet(penguins, "some_dir/myfile.parquet")
(currently throwing IO Error
) instead of modifying the existing to_parquet()
functionality?
Since we have to_parquet_dir
, can we use that? Let's avoid stuffing directory support into duckdb's to_parquet
if we can.
Since we have to_parquet_dir, can we use that? We have it only for
pyspark
(see https://github.com/ibis-project/ibis/blob/fd61f2c34eafcc50d7a8de652222554b45bd2ef6/ibis/backends/pyspark/init.py#L1291), but we need to implement it forduckdb
.
I just want to make sure we want to implement it just for this case con.to_parquet(penguins, "some_dir/myfile.parquet")
. My concern is that for hive partitions this is already covered by regular to_parquet
which means that we will have writing to directories supported for hive partitions via to_parquet
but for single files in a directory, supported by to_parquet_dir
.
If a user wants to pass in kwargs to to_parquet
and get the files output in hive-partitioning, that's fine, we don't need to prevent that behavior, but we could expose those kwargs explicitly in to_parquet_dir
to make it more obvious that it's an option.
We may use pyarrow.parquet read_table
(it could read a single file or a directory ) for ibis backends that lack native read_parquet
support.
Here is another view of points: do we want to ensure consistency between to_parquet
and read_parquet
?