ducklake icon indicating copy to clipboard operation
ducklake copied to clipboard

NULL shared_ptr dereference internal exception caused by table_stats `table_id` equal to a `view_id`

Open kposborne opened this issue 4 months ago • 5 comments

Got this error when attempting to query the information_schema.tables using ducklake.

There was a row in the ducklake_table_stats table, which had a table_id equal to a view_id in the ducklake_view table. After manually deleting that row from the table_stats table, ducklake worked again. Looks like this is because it assumes any table_id in ducklake_table_stats corresponds to a DuckLakeTableEntry with non-null field_data, but instead in this case it was actually a DuckLakeViewEntry

Stacktrace:

INTERNAL Error: Attempted to dereference shared_ptr that is NULL!

Stack Trace:

/tmp/.duckdb/extensions/v1.3.2/linux_amd64/ducklake.duckdb_extension(+0x8a7446) [0x7ff102dfe446]
/tmp/.duckdb/extensions/v1.3.2/linux_amd64/ducklake.duckdb_extension(+0x8a74f4) [0x7ff102dfe4f4]
/tmp/.duckdb/extensions/v1.3.2/linux_amd64/ducklake.duckdb_extension(+0x8a8a41) [0x7ff102dffa41]
/tmp/.duckdb/extensions/v1.3.2/linux_amd64/ducklake.duckdb_extension(+0xf790e) [0x7ff10264e90e]
/tmp/.duckdb/extensions/v1.3.2/linux_amd64/ducklake.duckdb_extension(_ZN6duckdb15DuckLakeCatalog20LoadStatsForSnapshotERNS_19DuckLakeTransactionENS_16DuckLakeSnapshotERNS_18DuckLakeCatalogSetE+0x15d) [0x7ff10288e6bd]
/tmp/.duckdb/extensions/v1.3.2/linux_amd64/ducklake.duckdb_extension(_ZN6duckdb15DuckLakeCatalog19GetStatsForSnapshotERNS_19DuckLakeTransactionENS_16DuckLakeSnapshotE+0xa9) [0x7ff102896929]
/tmp/.duckdb/extensions/v1.3.2/linux_amd64/ducklake.duckdb_extension(_ZN6duckdb15DuckLakeCatalog13GetTableStatsERNS_19DuckLakeTransactionENS_16DuckLakeSnapshotENS_10TableIndexE+0x22) [0x7ff102896c02]
/tmp/.duckdb/extensions/v1.3.2/linux_amd64/ducklake.duckdb_extension(_ZN6duckdb15DuckLakeCatalog13GetTableStatsERNS_19DuckLakeTransactionENS_10TableIndexE+0x40) [0x7ff102896ca0]
/tmp/.duckdb/extensions/v1.3.2/linux_amd64/ducklake.duckdb_extension(_ZN6duckdb18DuckLakeTableEntry13GetTableStatsERNS_19DuckLakeTransactionE+0x5a) [0x7ff102919c8a]
/tmp/.duckdb/extensions/v1.3.2/linux_amd64/ducklake.duckdb_extension(_ZN6duckdb18DuckLakeTableEntry14GetStorageInfoERNS_13ClientContextE+0x31) [0x7ff102919d21]
/app/.venv/lib/python3.12/site-packages/duckdb/duckdb.cpython-312-x86_64-linux-gnu.so(_ZN6duckdb20DuckDBTablesFunctionERNS_13ClientContextERNS_18TableFunctionInputERNS_9DataChunkE+0xa3) [0x7ff1beadaa03]
/app/.venv/lib/python3.12/site-packages/duckdb/duckdb.cpython-312-x86_64-linux-gnu.so(_ZNK6duckdb17PhysicalTableScan7GetDataERNS_16ExecutionContextERNS_9DataChunkERNS_19OperatorSourceInputE+0x4c) [0x7ff1be7254bc]
/app/.venv/lib/python3.12/site-packages/duckdb/duckdb.cpython-312-x86_64-linux-gnu.so(_ZN6duckdb16PipelineExecutor15FetchFromSourceERNS_9DataChunkE+0x7d) [0x7ff1becbec6d]
/app/.venv/lib/python3.12/site-packages/duckdb/duckdb.cpython-312-x86_64-linux-gnu.so(_ZN6duckdb16PipelineExecutor7ExecuteEm+0xb8) [0x7ff1becc94c8]
/app/.venv/lib/python3.12/site-packages/duckdb/duckdb.cpython-312-x86_64-linux-gnu.so(_ZN6duckdb12PipelineTask11ExecuteTaskENS_17TaskExecutionModeE+0xd2) [0x7ff1becc9802]
/app/.venv/lib/python3.12/site-packages/duckdb/duckdb.cpython-312-x86_64-linux-gnu.so(_ZN6duckdb12ExecutorTask7ExecuteENS_17TaskExecutionModeE+0xd6) [0x7ff1becc0796]
/app/.venv/lib/python3.12/site-packages/duckdb/duckdb.cpython-312-x86_64-linux-gnu.so(_ZN6duckdb13TaskScheduler14ExecuteForeverEPSt6atomicIbE+0x142) [0x7ff1becc2ed2]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3) [0x7ff2420a64a3]
/lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x7ff2444471f5]
/lib/x86_64-linux-gnu/libc.so.6(__clone+0x40) [0x7ff2444c6b00]

This error signals an assertion failure within DuckDB. This usually occurs due to unexpected conditions or errors in the program's logic.
For more information, see https://duckdb.org/docs/stable/dev/internal_errors

kposborne avatar Jul 26 '25 00:07 kposborne

Thanks for the report!

Identifiers should be unique globally, i.e. if a given id refers to a view it should not also refer to a table. We could improve the error message here but the DuckLake is in an invalid state.

Did you generate this database manually, or was this generated by the DuckLake extension itself? If the latter, could you try to create a reproducer?

Mytherin avatar Jul 28 '25 08:07 Mytherin

Got it, all changes to this DB were through the DuckLake extension (using Postgres). I am not sure how it got into this state or how to reproduce it yet, but if I figure it out I'll make sure to report back.

kposborne avatar Jul 28 '25 17:07 kposborne

Thanks for opening this issue in the DuckLake issue tracker! To resolve this issue, our team needs a reproducible example. This includes:

  • A source code snippet which reproduces the issue.
  • The snippet should be self-contained, i.e., it should contain all imports and should use relative paths instead of hard coded paths (please avoid /Users/JohnDoe/...).
  • A lot of issues can be reproduced with plain SQL code executed in the DuckDB command line client. If you can provide such an example, it greatly simplifies the reproduction process and likely results in a faster fix.
  • If the script needs additional data, please share the data as a CSV, JSON, or Parquet file. Unfortunately, we cannot fix issues that can only be reproduced with a confidential data set. Support contracts allow sharing confidential data with the core DuckDB team under NDA.

For more detailed guidelines on how to create reproducible examples, please visit Stack Overflow's “Minimal, Reproducible Example” page.

duckdblabs-bot avatar Aug 04 '25 07:08 duckdblabs-bot

I work with @kposborne. We still do not have a repro code snippet, but wanted to share some more contextual information based on recent investigation. (TL;DR - we've observed the catalog corruption several times, and it always seems to have been caused by concurrent view creation and insertion into a table leaving behind a faulty table stat entry pointing to the view instead of the table; however, we have not been able to reproduce this for some time now.)

We've observed this issue at least 4 times, and in each case, there was an errand row in the ducklake_table_stats table that had a value for table_id that corresponded to a view_id value in the ducklake_views table. With this faulty row, it seems that no query against the DuckLake would succeed, even a simple SHOW TABLES would immediately result in an error. Depending on the nature of the of the query, the error we get is typically in either the form of the INTERNAL Error: Attempted to dereference shared_ptr that is NULL! shown in the original post, or a Fatal Python error: Segmentation fault error. In each of those instances, removing the faulty row restored normal operations for the DuckLake.

Digging deeper into one of the occurrences, I found that the view involved in the error was linked to an entry in the ducklake_snapshot table via the ducklake_view table's begin_snapshot field:

Row in `ducklake_view`:
┌─────────┬──────────────────────┬────────────────┬──────────────┬───────────┬─────────────────────┬─────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────┐
│ view_id │      view_uuid       │ begin_snapshot │ end_snapshot │ schema_id │      view_name      │ dialect │                                                        sql                                                         │ column_aliases │
│  int64  │         uuid         │     int64      │    int64     │   int64   │       varchar       │ varchar │                                                      varchar                                                       │    varchar     │
├─────────┼──────────────────────┼────────────────┼──────────────┼───────────┼─────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────┤
│  10337  │ f95734b9-fdd4-4c06…  │     22706      │     NULL     │     1     │ read_route_closures │ duckdb  │ SELECT d.* EXCLUDE (_ascend_block_uuid), c.partition_uuid AS _ascend_partition_uuid FROM ascend_metadata_deploym…  │                │
└─────────┴──────────────────────┴────────────────┴──────────────┴───────────┴─────────────────────┴─────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────┘


 Row in `ducklake_snapshot`: 
┌─────────────┬────────────────────────────┬────────────────┬─────────────────┬──────────────┐
│ snapshot_id │       snapshot_time        │ schema_version │ next_catalog_id │ next_file_id │
│    int64    │  timestamp with time zone  │     int64      │      int64      │    int64     │
├─────────────┼────────────────────────────┼────────────────┼─────────────────┼──────────────┤
│    22706    │ 2025-07-25 12:00:54.051-07 │     17575      │      10338      │     7523     │
└─────────────┴────────────────────────────┴────────────────┴─────────────────┴──────────────┘

In the postgres database's logs, I see the following error that happened shortly after the time of the creation of this snapshot, showing a primary key violation on that same snapshot_id value

2025-07-25 19:01:10 UTC:192.168.63.239(58846):asc_a_kn7147_piubpo_wq6px3_e@asc_a_kn7147_piubpo:[29312]:ERROR:  duplicate key value violates unique constraint "ducklake_snapshot_pkey"

2025-07-25 19:01:10 UTC:192.168.63.239(58846):asc_a_kn7147_piubpo_wq6px3_e@asc_a_kn7147_piubpo:[29312]:DETAIL:  Key (snapshot_id)=(22706) already exists.

2025-07-25 19:01:10 UTC:192.168.63.239(58846):asc_a_kn7147_piubpo_wq6px3_e@asc_a_kn7147_piubpo:[29312]:CONTEXT:  COPY ducklake_snapshot, line 1

2025-07-25 19:01:10 UTC:192.168.63.239(58846):asc_a_kn7147_piubpo_wq6px3_e@asc_a_kn7147_piubpo:[29312]:STATEMENT:  COPY "deployment_development"."ducklake_snapshot" FROM STDIN (FORMAT BINARY)

It seems to be a plausible deduction that what happened here is that the view read_route_closures was created around the same time as some other data operation that triggered in snapshot creation (table creation or insertion), resulting in both operations being assigned the same snapshot_id of 22706. The view committed first (snapshot time of 2025-07-25 12:00:54.051-07), followed shortly thereafter by a commit attempt of the other operation that led to the primary key violation in the Postgres database (log timestamp of 2025-07-25 19:01:10 UTC, so roughly 15 seconds later).

Based on the comments on another Issue in the Ducklake repo, it seems likely that an internal retry in the Ducklake code would then kick in and retry the operation with a new snapshot_id value; if, then, somehow this retry logic failed to properly clean up or update the table_stat entry, that could explain how the corruption of the database came about. (Obviously a bunch of assumptions/hypothesizing here)

Nonetheless, seeing that all instances of catalog corruption seemed to involve concurrent creation of a view and addition of table data, I spent some time last week trying to recreate the error above in a standalone script. However, while I am able to reliably trigger the duplicate key value violates unique constraint "ducklake_snapshot_pkey" error that was observed in the Postgres logs, none of these occurrences have so far resulted in a catalog corruption.

Also noteworthy: we have not seen the catalog corruption occur within our application since at least Monday 8/4 despite active attempts to reproduce it. Perhaps the underlying issue has been resolved in the meantime.

h2o1 avatar Aug 12 '25 08:08 h2o1

Thanks for the detailed explanation @h2o1! I have to say this is a difficult one, as it seems it can't be reproduced reliably. Since you posted this issue 3 weeks ago, there was a minor change in making the retry option global (see https://github.com/duckdb/ducklake/pull/340), but i am not sure whether this affects the issue or not. So far we can maybe keep this open and then close it in some weeks if the issue seems to be resolved. FYI v0.3 will be released together with Duckdb 1.4 on the 9th of September

guillesd avatar Aug 13 '25 10:08 guillesd