Unexpected context used in Scan.PlanFiles
Apache Iceberg version
main (development)
Please describe the bug 🐞
Scan.PlanFiles takes a context argument. This creates the expectation that this context is actually used internally for downloading objects from S3. However this is not the case: the context is not propagated through the IO abstraction and the implementation actually uses a stored context (previously stored from Catalog.LoadTable).
This breaks the case where Tables are cached across requests (to avoid hitting catalog and download/parse table metadata on each request), as setting a context with timeout on LoadTable results in getting "context cancelled" errors in any further request.
The reproducer below just uses LoadTable from a separate function and cancels the context with timeout for resource cleanup (as per recommended practice), then PlanFiles fails with "context canceled".
func TestLoadTableWithTimeout(t *testing.T) {
ctx := context.Background()
cat, err := GetCatalog(ctx)
require.NoError(t, err)
tbl, err := loadTableWithTimeout(ctx, cat, "db.test_table")
require.NoError(t, err)
// the following fails, because PlanFiles does not really use passed context
// error: Received unexpected error:
// could not open manifest file: operation error S3: GetObject, context canceled
_, err = tbl.Scan().PlanFiles(ctx)
require.NoError(t, err)
}
func loadTableWithTimeout(ctx context.Context, cat catalog.Catalog, tblName string) (*table.Table, error) {
ctxWithTimeout, cancelFn := context.WithTimeout(ctx, 1*time.Minute)
defer cancelFn()
return cat.LoadTable(ctxWithTimeout, catalog.ToIdentifier(tblName), nil)
}
We use the Glue catalog, but any implementation that uses io.LoadFS ultimately stores the context in IO implementation.
This is a significant issue, and unfortunately, there isn't a simple fix.
The current behavior is problematic: we rely on context in two separate places. Cancelling the context used by the catalog effectively renders the table unusable. On the other hand, cancelling the context used by PlanFiles only gives the illusion of cleaning up resources - it doesn't actually stop any ongoing I/O operations.
This is the crux of the problem. From a user perspective, if the context passed to PlanFiles is cancelled, all I/O operations should stop immediately. To achieve that, we need to ensure that I/O is bound to the closest context—in this case, the one passed to PlanFiles.
We’re faced with two options for addressing the current behavior:
- Decouple the catalog context from later I/O entirely, and shift the responsibility to the PlanFiles context.
- Combine both contexts - one from the catalog and one for
PlanFiles- and propagate them together.
Whichever path we choose, one thing is clear: all I/O operations must respect the context passed to PlanFiles(ctx). To support that, we should stop passing pre-constructed I/O objects and instead construct them ad hoc within PlanFiles, using the provided context
@zeroshade plz take a look on my draft attempt: https://github.com/apache/iceberg-go/pull/452
There is couple minor backward incompatible API changes:
table.New*API now requires ctx and ioFunc instead io objectsFS() io.IOnowFS(ctx context.Context) (io.IO, error) { return t.fsF(ctx) }, this is honestly a biggest change
Thanks for taking a stab at it. I'll take a look at it tomorrow, just wanna give my initial thoughts here:
We’re faced with two options for addressing the current behavior:
Decouple the catalog context from later I/O entirely, and shift the responsibility to the PlanFiles context.
Combine both contexts - one from the catalog and one for PlanFiles - and propagate them together.
My preference would be to decouple the catalog context from later I/O entirely. Ideally, the preference would be for the catalog to not store a context at all.
Agree with this. The context in the catalog should only be used for I/O operations during table or catalog construction. The current PR proposals reflect this.