[Epic] S3 bulk exports (`COPY TO ... S3`)
Product outcome
Users want to be able to connect their streaming data to their batch data warehouses (Redshift, snowflake, bigquery, databricks etc...). Long term we'll aim to build direct connections to all major batch datawarehouses, however in the short term all of these solutions have well understood paths for continually consuming data from S3 and should be reasonably supportable as long as we're able to sink the data to S3.
Shape of data being SINKed User interviews have told us that generally users want an up to date "snapshot" of their materialized view. The changelog generally has less value to them outside of using it to keep their external snapshot up to date.
Milestones
- Initial milestone is aligned on supporting a single shot
COPY TO S3 - Future milestone options are
- Build out scheduling support for snapshotting a materialized view or query to S3 on some cadence (benefit is that it's more specifically what users have expressed wanting, drawback is that it doesn't scale)
- Build out scheduling support for emitting the changelog for some time period to S3 on some cadence and give users a playbook for merging that changelog into their existing data warehouse snapshot
Discovery
Decision log
- 15 May 2024. Release into Private Preview.
- 15 March 2024. Slip release date to April 5 due to lack of staffing.
- 4 January 2024. Moved work items to Github task lists.
- 24 December 2023. Split "private preview" and "public preview" work items. Add cloud#8242 to "public preview work items."
- 30 November 2023. Added proposed dogfooding opportunity to work items.
- 9 November 3023. Fleshed out work items and pushed out release date to January
- 7 November 2023. Aligned on 12/8 as target release to private preview
- 15 November 2023. Aligned on initial path forward being single shot COPY TO S3 command to build the muscle for writing data to S3
### Public Preview Work Items
- [ ] Ensure cancellation with Ctrl-C works properly
- [ ] https://github.com/MaterializeInc/cloud/issues/8261
- [ ] https://github.com/MaterializeInc/operational-analytics/issues/4
- [ ] https://github.com/MaterializeInc/cloud/issues/8242
- [ ] https://github.com/MaterializeInc/console/issues/1224
- [ ] https://github.com/MaterializeInc/console/issues/1440
- [ ] copy-to-s3: Tune the file-size and buffer ratio parameters by introducing test cases to see how much max-file-size is exceeded on large copies
- [ ] https://github.com/MaterializeInc/materialize/issues/26509
### Public Preview bug tracker
- [ ] https://github.com/MaterializeInc/materialize/issues/27101
- [ ] https://github.com/MaterializeInc/materialize/issues/26963
- [ ] https://github.com/MaterializeInc/materialize/issues/27126
### GA Work Items
- [ ] https://github.com/MaterializeInc/materialize/issues/26164
### Private Preview Work Items
- [x] Initial exploration + prototype @moulimukherjee
- [ ] https://github.com/MaterializeInc/materialize/issues/23055
- [ ] https://github.com/MaterializeInc/materialize/pull/23556
- [ ] https://github.com/MaterializeInc/materialize/issues/24217
- [ ] https://github.com/MaterializeInc/materialize/pull/24355
- [ ] https://github.com/MaterializeInc/materialize/pull/24562
- [ ] https://github.com/MaterializeInc/materialize/pull/25042
- [ ] https://github.com/MaterializeInc/materialize/pull/25087
- [ ] https://github.com/MaterializeInc/materialize/pull/25521
- [ ] https://github.com/MaterializeInc/materialize/pull/25574
- [ ] https://github.com/MaterializeInc/materialize/pull/25647
- [ ] https://github.com/MaterializeInc/materialize/pull/25785
- [ ] https://github.com/MaterializeInc/materialize/issues/25857
- [ ] https://github.com/MaterializeInc/materialize/issues/26218
- [ ] https://github.com/MaterializeInc/materialize/issues/25835
- [ ] https://github.com/MaterializeInc/materialize/issues/26403
- [ ] https://github.com/MaterializeInc/materialize/issues/26491
- [ ] https://github.com/MaterializeInc/materialize/issues/26490
- [ ] https://github.com/MaterializeInc/materialize/issues/26533
- [x] storage/copy-to-s3: Implement COPY <obj> TO for views, mat-views, and source objects
- [x] Parquet support
- [ ] Promote AWS connections to Public Preview
- [x] Cleanup TODOs
- [ ] https://github.com/MaterializeInc/materialize/issues/26740
- [x] copy-to-s3: Parquet nested types (map, array, list)
- [ ] https://github.com/MaterializeInc/materialize/issues/26884
Will need to determine if these are one-shot sinks, streaming sinks, or both and tackle those as separate work items. The clearest path and likely first step is one-shot sinks, making this dependent on #6997 followup.
Linking in a dogfooding opportunity: https://github.com/MaterializeInc/cloud/issues/8197#issuecomment-1835403813
One wrinkle in that opportunity: it requires support for copying arbitrary queries, and I've heard rumors that we're only planning to support copying tables, matviews, and sources directly. @hlburak @moulimukherjee — is that true? Do y'all have more context on how hard it would be to support arbitrary queries?
we're only planning to support copying tables, matviews, and sources directly
@benesch That's the plan fort the first version. I think the idea is to be able to copy out any object in persist, but @moulimukherjee can speak to that better. We have a meeting with @jpepin next week where we can dive into how we'd use this to replace the catalog exporter.
The only wrinkle I see is that we'll need to be able to use arbitrary queries in COPY TO S3 in order to add column metadata and such—e.g.
The examples you give -- mz_now() and environment id -- might be fields we want to incorporate into file naming conventions anyway ... not sure if that's helpful or makes a mess of reconstructing the dataset downstream of S3.
I'll add details here later today for another dogfooding use case we've discussed with @matthelm : getting transformed data from our operational data warehouse into S3 > snowflake.