xarray-sql
xarray-sql copied to clipboard
Distributed Execution on Beam
Figure out a way to distribute all layers of SQL execution #10 on Apache Beam.
Dataframes: https://beam.apache.org/documentation/dsls/dataframes/overview/ Xarray: Xarray-Beam
Beam's dataframes library supports multi indexes.
https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html
This alone makes beam worthy of an exploration sooner rather than later.
Interesting!
Some general thoughts on this issue in no particular order:
- I think this would make a good 0.1 release
- Users can call xarray_sql.beam.read_xarray()
- Beam supports the Pandas API, with small differences, primarily specific to the fact that PCollections are unordered.
- Unknown: is there a
from_map-like interface to make implementing this easy?
This may not be feasible after all. It looks like hdf5 is intentionally not supported because it is a random access format. I think Xarray would follow this characteristic, too.
https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/dataframe/io.html
Maybe this warrants the creation of an xarray-beam-like library for pandas or dask? Can a pd.(multi)index mimic an xbeam key?
A core question to answer: do we really need random access?
I think Ray is a better, easier fit (see #68). Closing.