xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Should Xarray have a read_csv method?

Open jhamman opened this issue 3 years ago • 4 comments
trafficstars

Is your feature request related to a problem?

Most users of Xarray/Pandas start with an IO call of some sort. In Xarray, our open_dataset(..., engine=engine) interface provides an extensible interface to more complex backends (NetCDF, Zarr, GRIB, etc.). For tabular data types, we have traditionally pointed users to Pandas. While this works for users that are comfortable with Pandas, it is an added hurdle to users getting started with Xarray.

Describe the solution you'd like

It should be easy and obvious how a user can get a CSV (or other tabular data) into Xarray. Ideally, we don't force the user to use a third part library.

Describe alternatives you've considered

I can think of three possible solutions:

  1. We expose a new function read_csv, it may do something like this:
def read_csv(filepath_or_buffer, **kwargs):
    df = pd.read_csv(filepath_or_buffer, **kwargs)
    ds = xr.Dataset.from_dataframe(df)
    return ds
  1. We develop a storage backend to support reading CSV-like data:
ds = open_dataset(filepath, engine='csv')
  1. We copy (1) as an example and put it in Xarray's documentation. Explicitly showing how you would use Pandas to produce a Dataset from a CSV.

jhamman avatar Sep 22 '22 21:09 jhamman

Ideally, we don't force the user to use a third party library.

Option (2) would make more sense once we hopefully eventually make pandas an optional dependency too.

I think (1) and (3) are complementary - we already have pseudocode implementations of open_mfdataset in the docs for example.

(2) seems like overkill unless there is specific functionality you are imagining this enabling?

TomNicholas avatar Sep 23 '22 14:09 TomNicholas

(1) and (3) sound good to me.

For (2), one thing that comes to mind is that all the decode_* options don't make any sense for csv files?

dcherian avatar Sep 28 '22 16:09 dcherian

Agree with (1) or (3). I do (1) a lot, no harm in adding it to xarray.

I could also imagine (2) with options for backends (e.g. pandas as one option). But I would vote against developing our own.

max-sixty avatar Sep 28 '22 17:09 max-sixty

Seeing that in another issue (can't find the number) it was discussed to move away from specialized read/write methods and only use open_dataset(maybe a Dataset.read?) and ds.write with engine arguments. Therefore option 2 would make sense.

However, new users that might be familiar with pandas are maybe more comfortable with an xr.read_csv.

On a side note: there is also dask.dataframe.read_csv.

headtr1ck avatar Oct 02 '22 07:10 headtr1ck

The current FAQ's table and CSV section both currently imply something like option (2) already exists, until you get to the code and in fact its option (3) that's implemented.

I'm only a user, but I see value in option (2) for standardization to the open_dataset method and for the flexibility of choosing pandas or dask backends. Sure, no decode_* arguments are applicable, but no one should expect to do anything with attributes from a file format without attributes.

Personally I'd also favor option (2) because I had a thought to extend it for reading CSV files with bespoke standards used at NASA for headers that actually do provide NetCDF-like attributes (e.g. ICARTT, SeaBASS).

Anyway, good discussion! Hope it doesn't fall too far down the list.

itcarroll avatar Jun 13 '23 01:06 itcarroll