xarray-sql Distributed Execution on Beam

Distributed Execution on Beam

Open alxmrs opened this issue 1 year ago • 6 comments

Figure out a way to distribute all layers of SQL execution #10 on Apache Beam.

Feb 17 '24 11:02 alxmrs

Dataframes: https://beam.apache.org/documentation/dsls/dataframes/overview/ Xarray: Xarray-Beam

Feb 18 '24 09:02 alxmrs

Beam's dataframes library supports multi indexes.

https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html

This alone makes beam worthy of an exploration sooner rather than later.

Mar 12 '24 13:03 alxmrs

Interesting!

Mar 12 '24 14:03 cisaacstern

Some general thoughts on this issue in no particular order:

I think this would make a good 0.1 release
Users can call xarray_sql.beam.read_xarray()
Beam supports the Pandas API, with small differences, primarily specific to the fact that PCollections are unordered.
Unknown: is there a from_map-like interface to make implementing this easy?

Mar 12 '24 15:03 alxmrs

This may not be feasible after all. It looks like hdf5 is intentionally not supported because it is a random access format. I think Xarray would follow this characteristic, too.

https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/dataframe/io.html

Maybe this warrants the creation of an xarray-beam-like library for pandas or dask? Can a pd.(multi)index mimic an xbeam key?

Mar 12 '24 18:03 alxmrs

A core question to answer: do we really need random access?

Mar 13 '24 02:03 alxmrs

I think Ray is a better, easier fit (see #68). Closing.

Sep 27 '25 19:09 alxmrs

xarray-sql xarray-sql copied to clipboard

Distributed Execution on Beam

xarray-sql
xarray-sql copied to clipboard