xarray-sql icon indicating copy to clipboard operation
xarray-sql copied to clipboard

Distributed Execution on Beam

Open alxmrs opened this issue 1 year ago • 6 comments

Figure out a way to distribute all layers of SQL execution #10 on Apache Beam.

alxmrs avatar Feb 17 '24 11:02 alxmrs

Dataframes: https://beam.apache.org/documentation/dsls/dataframes/overview/ Xarray: Xarray-Beam

alxmrs avatar Feb 18 '24 09:02 alxmrs

Beam's dataframes library supports multi indexes.

https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html

This alone makes beam worthy of an exploration sooner rather than later.

alxmrs avatar Mar 12 '24 13:03 alxmrs

Interesting!

cisaacstern avatar Mar 12 '24 14:03 cisaacstern

Some general thoughts on this issue in no particular order:

  • I think this would make a good 0.1 release
  • Users can call xarray_sql.beam.read_xarray()
  • Beam supports the Pandas API, with small differences, primarily specific to the fact that PCollections are unordered.
  • Unknown: is there a from_map-like interface to make implementing this easy?

alxmrs avatar Mar 12 '24 15:03 alxmrs

This may not be feasible after all. It looks like hdf5 is intentionally not supported because it is a random access format. I think Xarray would follow this characteristic, too.

https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/dataframe/io.html

Maybe this warrants the creation of an xarray-beam-like library for pandas or dask? Can a pd.(multi)index mimic an xbeam key?

alxmrs avatar Mar 12 '24 18:03 alxmrs

A core question to answer: do we really need random access?

alxmrs avatar Mar 13 '24 02:03 alxmrs

I think Ray is a better, easier fit (see #68). Closing.

alxmrs avatar Sep 27 '25 19:09 alxmrs