xarray-sql icon indicating copy to clipboard operation
xarray-sql copied to clipboard

Zarrquet: Create a referenced based filesystem to read virtual parquet from Zarr.

Open alxmrs opened this issue 1 year ago • 26 comments

https://arrow.apache.org/docs/python/filesystems.html

Like Kerchunk, it should be possible to create a virtual parquet dataset from an existing Zarr store. Here, it would also provide a table-like view of a dense array — like the rest of qarray.

Ideally, this could be persisted in a file system without creating extra space (just metadata). Then, one could use tools like Pandas, DuckDB, Apache Arrow, etc., to query raster data as tables, maybe even outside of Python.

My bet here is that the LOC to prototype this is going to be small. Getting it performant is another matter. I wonder if this would be a justification for BigQuery or other similarly high powered query engine? BQ reads parquet!

alxmrs avatar Jan 23 '24 07:01 alxmrs

cc: @cisaacstern

alxmrs avatar Jan 23 '24 07:01 alxmrs

I’ve been doing some background reading on parquet. Here are some useful things I’ve learned:

  • Arrow is the in memory version of the file format.
  • The project is 11 or so years old and comes from the Hadoop ecosystem (i.e. is well battle tested)
  • A core feature is column oriented chunks. Like, the format groups rows together abstractly but puts each column of a set of rows in a logical, if not physical grouping (column chunks can be one or more file).
    • It seems like Zarr chunks would map on to parquet well IFF they can be interpreted as column chunks. If I have the right mental model, I think unraveled Zarr would be a compatible with parquet.
  • Martin’s fastparquet is a good reference to understand the file format nitty-gritty: https://github.com/dask/fastparquet

alxmrs avatar Mar 17 '24 08:03 alxmrs

The thrift definition is especially helpful to read.

https://github.com/dask/fastparquet/blob/main/fastparquet/parquet.thrift

Some initial thoughts:

  • is it possible that a Zarr chunk in a bucket could be the filepath of a column chunk?
  • If we can make calculating the metadata optional, maybe we could get slower, but readable zarrquet.

alxmrs avatar Mar 17 '24 09:03 alxmrs

Reading the Data Page V1 algorithm in these sources (https://github.com/dask/fastparquet/blob/main/fastparquet/core.py), I think most Zarr would be represented in a Plain Thrift encoding.

alxmrs avatar Mar 17 '24 09:03 alxmrs

I've just started chatting with @martindurant in a side channel (Martin, I'm happy to bring the discussion here). He has a few thoughts on the matter:

  • First, he's been interested in this kind of feature for a long time (https://github.com/fsspec/kerchunk/issues/66). This doesn't surprise me at all :)
  • Feather is probably a better fit for mapping to Zarr than parquet, but parquet is more ubiquitous for representing tables
  • It is quite possible to represent Zarr as parquet, say, without nested columns or other corner case features.

I'm really happy to have a vote of confidence on this approach from a domain expert!

alxmrs avatar Mar 18 '24 05:03 alxmrs

I've discussed how to surface Zarrquet to databases with Martin a bit more. Here's the best path forward from the discussion:

  • Create a PyArrow or fsspec FileSystem to read Zarrquet in Python
    • Test it on DuckDB's Python interface
  • For all other DBs, we'll use this Python client as an adaptor to a Parquet file interface in a proxy server.
    • TBD if this would only act as a local server or something friendly to distributed reads.
    • A good examples to follow (thanks Martin): https://github.com/bluesky/tiled, WebHDFS.

alxmrs avatar Mar 27 '24 05:03 alxmrs

So, hi all, let me jump in here, because I think we are going on the same direction...

In Zarrquet we will store a chunk as a row, and bands as columns with xarrays data encoded. So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

Or in a virtual way, you are thinking on creating a "virtual" parquet file with the contents of a geozarr. And then BigQuery using a external table (parquet virtual tables), you end up creating a Virtual BigQuery table that is a connected to a Virtual Parquet File that is backed up by Geozarr stored in Cloud Storage?

jatorre avatar Apr 09 '24 14:04 jatorre

Yes, this is totally doable, to make a python file-like where the buffers are fetched from zarr chunks, and the parquet thrift metadata are made dynamically at runtime (or, optionally, stored elsewhere, which might be the easier way to start).

BigQuery using a external table

How does one persuade bigquery to read from a python file? We can get pyarrow or even duckdb to do this, but ...

xarrays data encoded.

Meaning simple things like gzip/zstd compression?

martindurant avatar Apr 09 '24 15:04 martindurant

Maybe you are thinking to host an HTTP server which pipes through virtual parquet bytes?

martindurant avatar Apr 09 '24 15:04 martindurant

you end up creating a Virtual BigQuery table that is a connected to a Virtual Parquet File that is backed up by Geozarr stored in Cloud Storage?

Yes, this is the end game.

Maybe you are thinking to host an HTTP server which pipes through virtual parquet bytes?

Yes — and the big unknown for me is the cloud architecture question. How do you build a proxy GCS bucket? I was imagining an elastic load balancer would play a key role.

alxmrs avatar Apr 09 '24 15:04 alxmrs

I am thinking less virtual in a way, what about the other no dynamic way.

In Zarrquet we will store a chunk as a row, and bands as columns with xarrays data encoded. So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

As a static file.

jatorre avatar Apr 09 '24 17:04 jatorre

This idea has a lot of merit. We did something similar over here: https://github.com/google/weather-tools/blob/main/weather_mv/README.md

The best place for the static approach (copying Zarr into Parquet) would be as an output sink for weather-mv.

In fact, this repo was inspired by this weather tool. The goal here is to find a way to provide a similar experience to the one offered by weather-mv bq without doubling storage or waiting for a large ETL job to finish.

alxmrs avatar Apr 10 '24 04:04 alxmrs

So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

Wait a second! I re-read this — I think I may have misunderstood what you meant. Are you saying that we can accomplish the goal of no-copy Zarr/Parquet with a static file?

I have began thinking this is possible! It looks like there is room in Parquet’s thrift format to point to Zarr array blobs as the backing data.

I have some notes about this around here: https://github.com/alxmrs/xarray-sql/issues/4#issuecomment-2002371311

alxmrs avatar Apr 10 '24 04:04 alxmrs

Wait a second! I re-read this — I think I may have misunderstood what you meant. Are you saying that we can accomplish the goal of no-copy Zarr/Parquet with a static file?

You can store the thrift serialised objects in file(s) rather than make them on the fly; but no, parquet does not let you point to binary buffers in some other file. It does in theory allow for each "column chunk" (thrift header + repetition/definition + buffer) to be in a different file, but in practice every parquet file except _metadata has a complete thrift footer and whole row groups.

martindurant avatar Apr 10 '24 13:04 martindurant

Also, I was chatting about this yesterday: it is totally possible to store N-D arrays in parquet despite its columnar nature: a 3D array would be list(list(number)), and the required rep/def levels would be easy to generate, as they would be regularly repeating. In V2 pages, the buffer is compressed separately from the rep/defs, so streaming data from chunks would work fine. The question is, do any parquet frameworks expect this sort of data?

martindurant avatar Apr 10 '24 13:04 martindurant

Also work noting (sorry for the spamming): fastparquet allows you to directly instantiate and serialised thrift objects. I don't believe arrow has any such ability. python-thrift also allows this, but is much slower.

martindurant avatar Apr 10 '24 13:04 martindurant

I am a bit lost here since I thought we would be able to store the arrays as arrow.

Not the same but geoarrow is doing this for vector data. I assumed using nested arrays in arrow was going to be no problem.

On Wed, 10 Apr 2024 at 06:44, Martin Durant @.***> wrote:

Also work noting (sorry for the spamming): fastparquet allows you to directly instantiate and serialised thrift objects. I don't believe arrow has any such ability. python-thrift also allows this, but is much slower.

— Reply to this email directly, view it on GitHub https://github.com/alxmrs/xarray-sql/issues/4#issuecomment-2047585009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA7GO4PISGMYN65MOPZ2G3Y4U62BAVCNFSM6AAAAABCGQTZDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBXGU4DKMBQHE . You are receiving this because you commented.Message ID: @.***>

jatorre avatar Apr 10 '24 13:04 jatorre

I am a bit lost here since I thought we would be able to store the arrays as arrow.

What do you mean by "arrow" here? Parquet is not arrow; feather2 essentially is.

Not the same but geoarrow is doing this for vector data. I assumed using nested arrays in arrow was going to be no problem.

Yes, the parquet format can store nested data, which would include geo structures (like list of points -> polygon, for each row). But this is presumably not what you have in zarr, right? Zarr does not have a concept of variable-length structures in each array position, except for string, JSON and pickle encoding (i.e., python objects, not what you want)

So: please show me some example zarr data you'd like to view, and I'll see if I can make a python file-like to hand to pyarrow. But I'll be using fastparquet to make the thrift, since I don't believe there's any other way.

martindurant avatar Apr 10 '24 14:04 martindurant

parquet does not let you point to binary buffers in some other file. It does in theory allow for each "column chunk" (thrift header + repetition/definition + buffer) to be in a different file, but in practice every parquet file except _metadata has a complete thrift footer and whole row groups.

Thanks Martin. That saves me from having to discover this first hand! Do you know of any format that would be a good fit, that allows for extra file storage? Feather, Avro, ORC, maybe?

The question is, do any parquet frameworks expect this sort of data?

Yes, this is really interesting! I agree, though, I don't see a use case for it for me in the near term.

I assumed using nested arrays in arrow was going to be no problem.

In a larger context, I see this work as largely experimental. It very well could not pan out as I find more information! Take these plans with a grain of salt.

alxmrs avatar Apr 13 '24 08:04 alxmrs

Do you know of any format that would be a good fit, that allows for extra file storage?

Sorry, no

martindurant avatar Apr 18 '24 19:04 martindurant

Given https://github.com/zarr-developers/zarr-specs/issues/287, it looks like future versions of Zarr will allow an fsspec interface beyond Python. It seems like a proxy server won’t be required.

alxmrs avatar May 26 '24 11:05 alxmrs

Pathway to Postgres integration: https://github.com/paradedb/paradedb/tree/dev/pg_lakehouse#overview

alxmrs avatar Jun 27 '24 21:06 alxmrs

You want postgres to query zarr data chunks with a parquet dynamic ephemeral intermediate representation? Actually, teaching duckDB about zarr might be simpler :) .

martindurant avatar Jun 28 '24 00:06 martindurant

@TomNicholas I just saw your tweet related to “headless data” and Pangeo. https://x.com/TEGNicholasCode/status/1813299299201294583

I wanted to reply: I think something like “Zarrquet” could be a good fit for the SQL interoperability box for pangeo.

alxmrs avatar Jul 17 '24 18:07 alxmrs

Thanks for tagging me @alxmrs - I think this also relates closely to something @manzt and I were talking about at SciPy. Basically using SQL tools to analyze Zarr stores on a chunk-level

TomNicholas avatar Jul 17 '24 18:07 TomNicholas

but parquet is more ubiquitous

AFAIK the objective of Carto's RaQuet is to bring everything under the Parquet umbrella because of Open Table Formats (DeltaLake, Iceberg) so that users of Databricks and the likes can manage their data in one location. OTFs elevate the level of abstraction (w/r to [geo]parquet). Now that "geo" is a new data type in Iceberg v3, I'm expecting DeltaLake to follow suit. Then I'm expecting OTFs will eventually partition/cluster parquet files based on the geo data types for query optimization.

For me, the dream would be able to carry out combined raster/vector analysis.

ymoisan avatar May 13 '25 13:05 ymoisan