ibis feat(parquet): figure out convention for multi-file parquet writing

Hmmm, I remember us talking about this before (sorry, this issue had slipped my mind when we talked about this last week).

Currently all our to_csv/to_parquet writers produce a single file (rather than a directory of files). There was an open question as to what the expected behavior was for backends where a single-file output is tricky/inefficient/impossible (backends like spark or dask).

Enumerating all the options I can think of:

1. `to_csv`/`to_parquet` always outputs a single file

We could fallback to pyarrow or error for backends where this is tricky. This is the current behavior.

2. `to_csv`/`to_parquet` always output a directory

Backends that only write a single file would convert to writing a directory with a single file. This would be a breaking change, but should work with any backend's native csv/parquet writer. This would make a common case of writing a small csv of results trickier though.

3. `to_csv`/`to_parquet` write the backend-native style

The output style (single file or directory) depends on the backend. This is easy to implement, but would mean different backends would result in different behaviors.

4. `to_csv`/`to_parquet` have an option to write a directory

These methods would write to a single file by default, but would have an option to instead write to a directory. :shrug: whether this is a directory=True option, or inferred somehow from the input path. This would mean that users using backends that can only efficiently write to directories may need to opt into this support, but would make things explicit about the output type.

t.to_csv("test.csv")  # would error for the spark backend? Or fallback to pyarrow?
t.to_csv("test/", directory=True)  # output a directory can use pyspark native behavior

5. New methods for writing directories

to_csv/to_parquet keep their existing behavior, and we add new to_csv_dir/to_parqet_dir (or much better named) methods for writing directories. Same caveats/questions as 4.

Right now I'm leaning towards 4. An extra flag seems fine to me, and I like it better than adding another top-level-method per file format just to support partitioned writing. cc'ing @gforsyth for a 2nd opinion though, since he may remember what conclusions we came to last time this came up.

Originally posted by @jcrist in https://github.com/ibis-project/ibis/issues/6615#issuecomment-1629759554

Additionally, I've documented (although this is now at least partly out of date) how a few systems handle various parquet partitioning schemes: https://gist.github.com/gforsyth/8dd4ca981b2beed6ef4db80f5e8afbfd

Opening this issue so we have something to track that isn't a comment in a closed PR.

Probably need to write out a taxonomy of what parquet functionality is supported by each backend natively

Mar 07 '24 21:03 gforsyth

Right now I'm leaning towards 4.

I agree this seems reasonable. However, taking the PySpark example—it would be weird if user behaviors end up being driven by the backend. It sounds like, a PySpark user would often want to specify directory=True to leverage the native path, while locally (e.g. using DuckDB) they may want to leave directory=False. It feels smoother that directory=False on PySpark would at least work, but the impression a user would have is that they're choosing output format, when in reality they're also choosing between a native and PyArrow-based bath.

Apr 04 '24 15:04 deepyaman

Right now I'm leaning towards 4.

I agree this seems reasonable. However, taking the PySpark example—it would be weird if user behaviors end up being driven by the backend. It sounds like, a PySpark user would often want to specify directory=True to leverage the native path, while locally (e.g. using DuckDB) they may want to leave directory=False. It feels smoother that directory=False on PySpark would at least work, but the impression a user would have is that they're choosing output format, when in reality they're also choosing between a native and PyArrow-based bath.

During triage earlier today, we collectively decided that option 5 makes the most sense. By having two separate APIs, we surface the directory option more obviously to users.

The directory option only applies for the write path; for reading, we will still maintain a single API. Furthermore, we will support directory vs. single file options, but not get into more backend-specific behavior (like glob handling); that will still be delegated to the backend.

Finally, it came up that at some point we may need to better support cloud I/O (e.g. via fsspec), but this will be put off until we get more users asking for it.

Apr 15 '24 23:04 deepyaman

On the topic of unifying behaviors across backends, is it weird if some backends require that, e.g., read_csv() be pointing to directory paths and other backends can read in individual csv files?

May 08 '24 19:05 chloeh13q

I'm looking into this one for duckdb which is what's missing implementing since the pyspark case was covered in https://github.com/ibis-project/ibis/pull/9272.

I noticed that for the case of writing hive partitions. if you do something like

>>> penguins = ibis.examples.penguins.fetch()
>>> con = ibis.get_backend(penguins)
>>> con.to_parquet(penguins, "my_dir" , partition_by="year")

That will create a directory called my_dir at the current location, and then the subsequent partition directories, in this case year=2007 year=2008 year=2009 with the respective parquet file in them.

So for the case of hive partitions we kind of are supporting writing to a directory.

Then the question is do we want a to_parquet_dir (based on this comment https://github.com/ibis-project/ibis/issues/8584#issuecomment-2057992081) method just to cover the case where you want to do

con.to_parquet(penguins, "some_dir/myfile.parquet") (currently throwing IO Error) instead of modifying the existing to_parquet() functionality?

Jul 25 '24 16:07 ncclementi

Since we have to_parquet_dir, can we use that? Let's avoid stuffing directory support into duckdb's to_parquet if we can.

Jul 30 '24 18:07 cpcloud

Since we have to_parquet_dir, can we use that? We have it only for pyspark(see https://github.com/ibis-project/ibis/blob/fd61f2c34eafcc50d7a8de652222554b45bd2ef6/ibis/backends/pyspark/init.py#L1291), but we need to implement it for duckdb.

I just want to make sure we want to implement it just for this case con.to_parquet(penguins, "some_dir/myfile.parquet"). My concern is that for hive partitions this is already covered by regular to_parquet which means that we will have writing to directories supported for hive partitions via to_parquet but for single files in a directory, supported by to_parquet_dir.

Jul 30 '24 18:07 ncclementi

If a user wants to pass in kwargs to to_parquet and get the files output in hive-partitioning, that's fine, we don't need to prevent that behavior, but we could expose those kwargs explicitly in to_parquet_dir to make it more obvious that it's an option.

Jul 30 '24 18:07 gforsyth

We may use pyarrow.parquet read_table (it could read a single file or a directory ) for ibis backends that lack native read_parquet support.

Here is another view of points: do we want to ensure consistency between to_parquet and read_parquet?

Aug 01 '24 00:08 jitingxu1

ibis ibis copied to clipboard

feat(parquet): figure out convention for multi-file parquet writing

1. to_csv/to_parquet always outputs a single file

2. to_csv/to_parquet always output a directory

3. to_csv/to_parquet write the backend-native style

4. to_csv/to_parquet have an option to write a directory

5. New methods for writing directories

ibis
ibis copied to clipboard

1. `to_csv`/`to_parquet` always outputs a single file

2. `to_csv`/`to_parquet` always output a directory

3. `to_csv`/`to_parquet` write the backend-native style

4. `to_csv`/`to_parquet` have an option to write a directory