ducklake icon indicating copy to clipboard operation
ducklake copied to clipboard

Automatic Version update is a breaking change - should either be a major version change, or not be an exception

Open grochmal opened this issue 3 months ago • 7 comments

What happens?

This is a migration issue, two clients are required one on DuckLake 0.2 another on DuckLake 0.3

We started using DuckLake since its release but the following code in src/storage/ducklake_initializer.cpp is currently breaking our workflow.

			if (version == "0.1") {
				metadata_manager.MigrateV01();
				version = "0.2";
			}
			if (version == "0.2") {
				metadata_manager.MigrateV02();
				version = "0.3";
			}
			if (version == "0.3-dev1") {
				metadata_manager.MigrateV02(true);
				version = "0.3";
			}
			if (version != "0.3") {
				throw NotImplementedException("Only DuckLake versions 0.1, 0.2, 0.3-dev1 and 0.3 are supported");
			}

The idea that the first v0.3 client connecting to DuckLake will update the lake to the latest metadata version is a good idea. But could we make that last if a warning rather than an exception?

How it breaks our workflow: our setup

We use DuckLake as our lake for access through 3 different clients:

  • the duckdb binary (on both linux and mac)
  • the python client through duckdb
  • the R client through duckdb

The releases of each client happened at different times. Which made our DuckLake inaccessible from some of our tools. Namely the releases happened as follows:

  • DuckLake 0.3 was released on 2025-09-17
  • The python client updated on 2025-09-16 ( a day earlier, nicely done! )
  • The R client updated on 2025-09-18 ( and the binary package is only available today 2025-09-19 )

How it broke the workflow: chronology of the break

One of our developers updated his duckdb binary on 2025-09-17 afternoon, and connected to the DuckLake. This connection triggered, the migration from 0.2 to 0.3 . After this connection all our pipelines stopped working because of the code above. Specifically because of the previous version of the code above, which was:

			if (version != "0.2") {
				throw NotImplementedException("Only DuckLake versions 0.1 and 0.2 are supported");
			}

That is OK-ish. We are using an experimental lake called Ducklake, so we should be wary of not using the latest version of the library.

To solve the issue i rushed and rebuilt all our pipelines. But then a bigger issue happened: All python and bash pipelines worked, whilst the R pipelines failed.

It took 2 days and a lot of hacks to get the R pipelines working again - we needed to wait for R's duckdb 1.4.0 to be released - which takes a long time given how slow CRAN is. All only because one developer connected with the latest duckdb binary.

The fact that the releases are not simultaneous make an update to one client a possible long break to another client because that second client still uses the old version for several days to come.

Metadata migration is hence a breaking change

And should have a major version update. That is one way to see this


Addendum.

Another way to see it is that the issue is the fact that an exception it thrown on DuckLake version mismatch in.

			if (version != "0.2") {
				throw NotImplementedException("Only DuckLake versions 0.1 and 0.2 are supported");
			}

This piece of code is the only thing that made the DuckLake v0.2 client not work with the DuckLake 0.3 metadata database. The DuckLakeMetadataManager::MigrateV02 function does not perform any breaking changes. And a DuckLake 0.2 client should still be able to use the lake.

Any DuckLake 0.2 client can use a DuckLake 0.3 metadata database at least for reading. One could do the check above on write, and make the exception on version mismatch a warning.

Either that, or make the release of DuckLake versions a breakiing change.

To Reproduce

Time of the occurrence

On 2025-09-18 no R duckdb client could connect to any DuckLake if that DuckLake has been connected by a CLI or Python client.

This is because the Python and CLI clients updated the DuckLake to 0.3 . But the R client supporting DuckLake 0.3 has not yet been released.

OS:

Linux + MacOS

DuckDB Version:

1.3.3 update to 1.4.0

DuckLake Version:

0.2 update to 0.3

DuckDB Client:

CLI + Python + R

Hardware:

x86_64 and arm64

Full Name:

Michal Grochmal

Affiliation:

CoSyne Therapeutics

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • [x] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • [x] Yes, I have

grochmal avatar Sep 19 '25 14:09 grochmal

This is a very fair point @grochmal. We are going to take a look

guillesd avatar Sep 22 '25 12:09 guillesd

Thanks @guillesd i appreciate that this is a very annoying report given that it is about backward compatibility for a tool (DuckLake) that is only 3 months old. I just hope it gets eventually fixed.

For anyone who reached this github issue with a similar problem, here is the workaround we did at CoSyne

TL;DR: We copied the required data for each of our pipelines into separate plain duckdb databases. Since each such DB is much smaller than the complete DuckLake, this was feasible for us.

Assumptions about the workaround

Since our DuckLake sits between data collection pipelines and data graphing (experiments) pipelines, we have a rather simple architecture. Namely the following things are true for us, and hence these assumptions allow us to make the described below workaround.

  • No pipeline reads and writes to the DuckLake, a pipeline either writes its output to the DuckLake or reads it inputs from the DuckLake
  • Pipelines that write data to the DuckLake run occasionally. More exactly, when a data source updates we run the pipeline. This almost always involves changes to the data collection pipeline code. These pipelines are never replicable.
  • Pipelines that read from the DuckLake are reproducible experiments, and their outputs are tiny amounts of data and often graphs. These are the only pipelines that require freezing of code dependencies (e.g. freezing of the duckdb version)

Thanks to the fact that we could put these assumptions in place we could build a workaround around the DuckLake version migration (this github issue). Our data collection pipelines and local development against the DuckLake deal with the version migration issue by forcing all users to always upgrade to the newest version of duckdb. The only data collection pipeline we have written in R can wait a few days for the R duckdb library to be updated in CRAN.

The workaround

We forced our data graphing pipelines to not connect to the DuckLake. We rebuilt a DuckLake snapshot functionality using plain duckdb. Since every data graphing pipeline knows which tables it requires we just copied the required tables into a plain duckdb database. Namely.

frozen_path = os.path.join("s3://our-bucket/path", "frozen", f"{uuid.uuid4().hex}")
duckdb_con.execute("USE memory;")
for table in tables:
    duckdb_con.execute(f"CREATE TABLE {table} AS SELECT * FROM {constants.DB_NAME}.{table};")
duckdb_con.execute(f"EXPORT DATABASE '{frozen_path}' (FORMAT PARQUET);")
return frozen_path  # so it can be used by the pipeline

We found that no single pipeline requires so much data that we need out-of-core processing. Hence using :memory: was fine. One could adapt this with a temporary file to get a out-of-core data load.

Then during the pipeline execution we perform:

if frozen:  # get from environment variables
    duckdb_con.execute(
        f"""
        IMPORT DATABASE '{frozen}'
        """
    )
else:
    # use the full ducklake
    duckdb_con.execute(
        f"""
        ATTACH 'ducklake:postgres:dbname=ducklake_catalog' AS {constants.DB_NAME} (DATA_PATH 's3://our-bucket/path');
        USE {constants.DB_NAME};
        """
    )

Since DuckLake behaves exactly as a duckdb database all code works the same against the DuckLake or against the fake ("frozen") lake.

There is a massive limitation in using :memory:. One could make something better with temporary files (feel free to adapt this workaround if you need!). Yet my hope is that by the time we (at CoSyne) hit :memory: limits, DuckLake will have some form of backward schema compatibility. And then we can just use DuckLake snapshots for reproducibility.

grochmal avatar Sep 25 '25 02:09 grochmal

Not sure if this is related but when using dbt-duckdb I get

Runtime Error
  Not implemented Error: Only DuckLake versions 0.1 and 0.2 are supported

when attaching DuckLake in the dbt profiles.yml

    lake:
      type: duckdb
      extensions:
        - ducklake
        - postgres
      secrets:
        - type: postgres
          name: pg_metadata
          host: ...
          port: ...
          user: ...
          password: ...
          database: ...
        - type: s3
          provider: config
          key_id: ...
          secret: ...
          region: ...
      attach:
        - path: "ducklake:postgres:"
          alias: my_ducklake
          options:
            meta_secret: pg_metadata
            data_path: s3://bucket-name
      database: my_ducklake

elisevansbbfc avatar Oct 02 '25 15:10 elisevansbbfc

It seems unrelated to me. This error does not come from DuckLake (I checked). What could be happening is that you are using DuckLake 0.3 and the dbt adapter does not support it but I am unsure since I am not super familiar with the dbt-duckdb implementation.

guillesd avatar Oct 03 '25 07:10 guillesd

Not sure if this is related but when using dbt-duckdb I get

Runtime Error
  Not implemented Error: Only DuckLake versions 0.1 and 0.2 are supported

when attaching DuckLake in the dbt profiles.yml

    lake:
      type: duckdb
      extensions:
        - ducklake
        - postgres
      secrets:
        - type: postgres
          name: pg_metadata
          host: ...
          port: ...
          user: ...
          password: ...
          database: ...
        - type: s3
          provider: config
          key_id: ...
          secret: ...
          region: ...
      attach:
        - path: "ducklake:postgres:"
          alias: my_ducklake
          options:
            meta_secret: pg_metadata
            data_path: s3://bucket-name
      database: my_ducklake

What is the duckdb version you're using?

dbt-ducklake seem to just use duckdb>=1.0.0 in its requirements. if you are using duckdb lower than 1.4.0 and the lake has every been used by any duckdb 1.4.0 client then upgrading to 1.4.0 will solve the issue.

I have not gone deep into the dbt-duckdb but the code seems to base its usage of ducklake based on the credentials you add into profiles.yml. Then it builds a DBT environment and uses duckdb (the python package) to connect using the credentials path directly

I may be reading the code wrongly but in a short summary. That profiles.yml with the following snippet.

>       attach:
>         - path: "ducklake:postgres:"
>           alias: my_ducklake
>           options:
>             meta_secret: pg_metadata
>             data_path: s3://bucket-name

Executes as:

conn = duckdb.connect(":memory:", read_only=False)
conn.execute("INSTALL postgres;")
conn.execute("INSTALL ducklake;")
conn.execute("CREATE SECRET <postgres stuff>;")
conn.attach("ATTACH ducklake:postgres: AS my_ducklake;")
conn.use("USE my_ducklake;")

So TL;DR:

  • This is using the Cpp ducklake extension alright. Just two layers deep. Hence it is very likely you are suffereing from the same as this issue. (Very likely, not 100% certain)
  • Force whatever venv/docker/whatever you're using to install and use duckdb>=1.4.0 it shall temporarily solve your problem

grochmal avatar Oct 07 '25 13:10 grochmal

Hi, forcing my venv to use duckdb>=1.4.0 works! Thanks and sorry for hijacking this issue lol

elisevansbbfc avatar Oct 07 '25 17:10 elisevansbbfc

Hi. We're also affected by this issue. Our use case is:

  • Two larger applications, using DuckDB.NET to read and write data to a shared DuckLake.
  • We're looking at allowing smaller Python notebooks to connect to the DuckLake directly so that analysts can explore and analyse data without going through the main applications.

There is a significant lag between when Python DuckDB releases come out and when DuckDB.NET gets updated. We have managed the migration from DuckLake 0.2 to DuckLake 0.3 for now, but we're just thinking ahead to when the next DuckLake spec comes out.

I've seen the MIGRATE_IF_REQUIRED option when attaching, and this seems like a step in the right direction. We've mitigated the risk of unexpected migrations for by always setting MIGRATE_IF_REQUIRED false when we connect to the DuckLake in all of our clients. It's tolerable for us to error and then have to fallback to using an older client if the spec is old, rather than having an unexpected migration occur and potentially lock out our other apps.

Would it make sense to be able to set the default behaviour for MIGRATE_IF_REQUIRED and persist it in the catalog metadata? That way, we can avoid the risk of someone attaching with a newer client and forgetting to set it, causing an unexpected migration.

Secondly, is it possible to rollback a DuckLake migration if needed?

AlphaSheep avatar Nov 27 '25 08:11 AlphaSheep