Matt Corley

Results 38 comments of Matt Corley

@kevinjqliu @Fokko Where would something like the Iceberg Spark `create_changelog_view` procedure fit in this roadmap? Is that something that might be tackled as part of the other procedures under table...

@kevinjqliu alas it's not as simple for iceberg because of the need to do field id-based projection to handle schema evolution. Somewhat relatedly: from what I remember, and assuming nothing...

Still, an api like `Table.as_of(snapshot_id/timestamp) -> Snapshot` would be useful, even if reading requires then passing the correct arguments to `Table.scan`. In general it should be easier for pyiceberg users...

> More over, multiple different snapshots can also be committed between two consecutive metadata json files. In what situations would that occur? In my (possibly incorrect) mental model of how...

Sounds like the ask here is for similar functionality in duckdb as was implemented in polars scan_iceberg. This relates also the previously discussed PyArrow Dataset protocol -- not sure if...

To work well with some of the larger data usecases where folks are using PySpark today, I think this would need to play well with pyarrow streaming read/write functionality, so...

This would really help us out, where we use Hadoop catalog for unit testing PySpark code, and are increasingly encountering cases where we want to test code that uses both...

@Fokko We do a setup similar to this for integration tests, but the ability to write faster unit tests that depend only on a temp directory fixture in pytest has...

I think there's still some confusion here, since there are two possible interpretations of "represent extending the API to allow same commit semantics like the java": - **Interpretation 1:** allow...

I think which blob storage to use in Azure should be a choice for the folks deploying the warehouse and not something that needs to be decided by iceberg sdks...