duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

DuckDB file keeps growing on data overwrite

Open jhmenke opened this issue 2 years ago • 13 comments

What happens?

I am connecting to DuckDB from Julia using DuckDB.jl

In an initial run, a lot of data is crawled and stored in different tables. On subsequent runs, only the new data is added to the tables. I have verified the table sizes to be correct: the tables have the same number of rows when there is no new data available. However, the DuckDB file size gets bigger and bigger on every new execution (almost doubles on the second run etc.). SHOW TABLES does not show any new tables and i make sure to delete any views.

I am unsure if this is wrong usage on my end, a bug in DuckDB.jl or a bug in DuckDB. Here is all the code i use to interact with the database. I have tried updating existing tables with 2 different functions (df_add_to_table_drop_previous_duplicates and add_to_table_drop_previous_duplicates) but the file keeps growing. The only code interacting with DuckDB i didn't post were the table initializations with CREATE TABLE IF NOT EXISTS

To Reproduce

using DataFrames, DuckDB, DBInterface, Dates, Random


function generate_random_dataframe(n::Int)
    rng = MersenneTwister(1234)  # Set the random number generator seed for reproducibility
    
    # Generate random float columns
    float_columns = [rand(rng, Float64, n) for _ in 1:100]
    
    # Generate random string columns
    string_columns = [rand(rng, ["foo", "bar", "baz"], n) for _ in 1:2]
        
    # Generate random DateTime column
    start_date = Date(2000, 1, 1)
    date_column = [start_date + Dates.Day(i - 1) for i in 1:n]
    
    # Create the DataFrame
    df = DataFrame()
    df.date = date_column
    df.identifier .= "id1"
    for i in 1:20
        df[!, Symbol("float_$i")] = float_columns[i]
    end
    for i in 1:2
        df[!, Symbol("string_$i")] = string_columns[i]
    end
    
    return df
end

function init_duckdb_at_location(path::String)
    conn = DBInterface.connect(DuckDB.DB, path)
    return conn
end

function df_add_to_table_drop_previous_duplicates(df::DataFrame, conn::DuckDB.DB, table_name::String)
    df_prev = read_table_into_dataframe(conn, table_name)
    println("size prev: ")
    println(size(df_prev))
    df_new = vcat(df_prev, df)
    unique!(df_new, [:identifier, :date], keep=:last)
    println("size new: " )
    println(size(df_new))
    DBInterface.execute(conn, "DELETE FROM $table_name;")
    DuckDB.register_data_frame(conn, df_new, "new_data")
    DBInterface.execute(conn, "INSERT INTO $table_name SELECT * FROM new_data;")
    DuckDB.unregister_data_frame(conn, "new_data")
end

function add_to_table_drop_previous_duplicates(df::DataFrame, conn::DuckDB.DB, table_name::String)
    for col in names(df)
        df[!, col] = replace(df[!, col], NaN => missing)
    end
    dropmissing!(df)

    DuckDB.register_data_frame(conn, df, "new_data")

    print("table size before: ")
    println(size(read_table_into_dataframe(conn, table_name)))

    DBInterface.execute(conn, """
        DELETE FROM $table_name 
        WHERE EXISTS (
        SELECT 1 FROM new_data 
        WHERE $table_name.identifier = new_data.identifier AND $table_name.date = new_data.date
        );    
    """)
    print("table size between: ")
    println(size(read_table_into_dataframe(conn, table_name)))

    DBInterface.execute(conn, "INSERT INTO $table_name SELECT * FROM new_data")
    print("table size after: ")
    println(size(read_table_into_dataframe(conn, table_name)))

    DuckDB.unregister_data_frame(conn, "new_data")
end

function read_table_into_dataframe(db::DuckDB.DB, table_name::String = "data")
    df = DataFrame(DBInterface.execute(db, "SELECT * FROM $table_name"))
    return df
end

function test()
    conn = init_duckdb_at_location("bugdb.db")
    DuckDB.execute(conn, """
        CREATE TABLE IF NOT EXISTS test_data (
            date DATE,
            identifier VARCHAR(255),
            float_1 FLOAT,
            float_2 FLOAT,
            float_3 FLOAT,
            float_4 FLOAT,
            float_5 FLOAT,
            float_6 FLOAT,
            float_7 FLOAT,
            float_8 FLOAT,
            float_9 FLOAT,
            float_10 FLOAT,
            float_11 FLOAT,
            float_12 FLOAT,
            float_13 FLOAT,
            float_14 FLOAT,
            float_15 FLOAT,
            float_16 FLOAT,
            float_17 FLOAT,
            float_18 FLOAT,
            float_19 FLOAT,
            float_20 FLOAT,
            string_1 VARCHAR(255),
            string_2 VARCHAR(255)
        )
    """)
    println(DataFrame(DBInterface.execute(conn, "SHOW TABLES")))


    df = generate_random_dataframe(100000)
    display(first(df, 5))

    # first method:
    df_add_to_table_drop_previous_duplicates(df, conn, "test_data")
    df_add_to_table_drop_previous_duplicates(df, conn, "test_data")
    # size at this point:  17.8 MB
    # for i in 1:10
    #     df_add_to_table_drop_previous_duplicates(df, conn, "test_data")
    # end
    # size after 10 overwrites:  107.2 MB

    # second method:
    # initial size after first fill: 12.3 kB
    # add_to_table_drop_previous_duplicates(df, conn, "test_data")
    # add_to_table_drop_previous_duplicates(df, conn, "test_data")
    # size at this point:  17.8 MB
    # for i in 1:10
    #     add_to_table_drop_previous_duplicates(df, conn, "test_data")
    # end
    # size after 10 overwrites:  107.2 MB
end

test()

OS:

MacOS And Linux

DuckDB Version:

(automatically installed with client)

DuckDB Client:

Julia 0.8.1

Full Name:

Jan-Hendrik Menke

Affiliation:

private project

Have you tried this on the latest master branch?

  • [ ] I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • [X] I agree

jhmenke avatar Jul 15 '23 07:07 jhmenke

Confirmed as happening with DuckDB CLI too

pbowyer avatar Sep 04 '23 17:09 pbowyer

Hey, great that you were able to reproduce this with the CLI, could you please provide the SQL script you ran there that showed the issue? Thanks

hannes avatar Sep 11 '23 10:09 hannes

I was able to confirm this too. It happens if I DELETE a bunch of rows and then attempt to re-insert the same rows into the same tables.

nettofarah avatar Sep 25 '23 21:09 nettofarah

Not sure if its same issue but posting it here.

duckdb version '0.9.0' also tried with '0.9.1' Python version 'Python 3.11.5' MacOS 13.5.2 Ventura

df.shape #(1166561, 30)

I save the dataframe to file via the following execute command.

with duckdb.connect(dbfile) as conn:
        conn.execute(f"CREATE OR REPLACE TABLE {df_name} AS SELECT * FROM {df_name}")

DB file size after the very first run is 211.3 MB DB file size after re-run of same df is 422.9MB DB file size after re-run of same df is 423.1MB DB file size after re-run of same df is 422.9MB DB file size after re-run of same df is 423.1MB DB file size after re-run of same df is 423.6MB Now create new .db file connection but same dataframe New DB file size after the very first run is 211.3 MB

I should expect the db file size to remain constant at 211.3MB since the SQL command is a REPLACE So how is the file size increasing with repetition?

vinaychuri avatar Oct 21 '23 03:10 vinaychuri

I was having this issue with Windows 10, Python 3.10 and Duckdb 8. I fixed it by upgrading to Duckdb 9.1.

Prussian1870 avatar Nov 06 '23 19:11 Prussian1870

I was having this issue with Windows 10, Python 3.10 and Duckdb 8. I fixed it by upgrading to Duckdb 9.1.

I just updated the DuckDB package and reran the script in the original post, but i still get the same behaviour.

jhmenke avatar Nov 07 '23 17:11 jhmenke

I'm running Microsoft Windows [Microsoft Windows [Version 10.0.19045.3570]. Could be a bug with Julia.

Prussian1870 avatar Nov 07 '23 19:11 Prussian1870

This might be related: https://github.com/duckdb/duckdb/discussions/9665, but in that case it appears that upgrading to v0.9.2 fixed it for me (python, windows)

NSchrading avatar Nov 16 '23 20:11 NSchrading

Hey chiming in here, I'm seeing similar behavior on a project that archives small JSON files in large quantities. I opened a repo with Python code and some sample data to reproduce the issue and details on running the code and context are there.

Summary is basically batch ingesting files appears to lead to larger duckdb file size compared to one complete ingest. More batches leads to higher sizes although the final number of files imported is constant. I've yet to try if EXPORTing the database and reimporting it leads to better compression.

image

jzavala-gonzalez avatar Jan 25 '24 18:01 jzavala-gonzalez

@jzavala-gonzalez thanks for putting in so much effort to reproducing this! We'll look into this.

szarnyasg avatar Jan 25 '24 18:01 szarnyasg

Found a quick workaround that seems to work for "resetting" the file size: Just export and import the database again.

In Python:

import os
import duckdb

# Export database
con = duckdb.connect('my_database.db') # has big file size
con.execute("export database 'my_database';")
con.close()

# Remove database
os.remove('my_database.db')

# Re-import database
con = duckdb.connect('my_database.db')
con.execute("import database 'my_database';")
con.close()

jzavala-gonzalez avatar Jan 25 '24 23:01 jzavala-gonzalez

This still happens with DuckDB 1.0.

skinkie avatar Jun 24 '24 08:06 skinkie

@hannes @szarnyasg

skinkie@archlinux netex]$ /home/skinkie/Sources/duckdb/build/duckdb delijn-test.duckdb 
v1.0.1-dev1830 1826262171
Enter ".help" for usage hints.
D CREATE TABLE stop_times AS SELECT * FROM read_csv('/tmp/stop_times.txt', delim=',', header=true, auto_detect=true, columns = {"trip_id": "VARCHAR", "arrival_time": "VARCHAR", "departure_time": "VARCHAR", "stop_id": "VARCHAR", "stop_sequence": "INTEGER", "stop_headsign": "VARCHAR", "pickup_type": "INTEGER", "drop_off_type": "INTEGER", "shape_dist_traveled": "FLOAT"});
100% ▕████████████████████████████████████████████████████████████▏ 
D 

[skinkie@archlinux netex]$ ls -l delijn-test.duckdb 
-rw-r--r-- 1 skinkie skinkie 225193984 Jun 24 11:14 delijn-test.duckdb

[skinkie@archlinux netex]$ /home/skinkie/Sources/duckdb/build/duckdb delijn-test.duckdb 
v1.0.1-dev1830 1826262171
Enter ".help" for usage hints.
D DROP TABLE IF EXISTS stop_times;
D CREATE TABLE stop_times AS SELECT * FROM read_csv('/tmp/stop_times.txt', delim=',', header=true, auto_detect=true, columns = {"trip_id": "VARCHAR", "arrival_time": "VARCHAR", "departure_time": "VARCHAR", "stop_id": "VARCHAR", "stop_sequence": "INTEGER", "stop_headsign": "VARCHAR", "pickup_type": "INTEGER", "drop_off_type": "INTEGER", "shape_dist_traveled": "FLOAT"});
D 

[skinkie@archlinux netex]$ ls -l delijn-test.duckdb 
-rw-r--r-- 1 skinkie skinkie 341848064 Jun 24 11:14 delijn-test.duckdb

skinkie avatar Jun 24 '24 09:06 skinkie

@skinkie thanks for the example! Two questions:

  1. did you try running CHECKPOINT after dropping the table?
  2. can you please attach stop_times.txt?

szarnyasg avatar Aug 06 '24 08:08 szarnyasg

@szarnyasg I think we can confirm that this issue may be resolved, with and without CHECKPOINT after dropping the table.

Regarding to your first question:

cd /tmp
wget https://gtfs.ovapi.nl/openov-nl/gtfs-openov-nl.zip
unzip gtfs-openov-nl.zip stop_times.txt
(venv) [skinkie@archlinux build]$ rm /tmp/test.duckdb 
(venv) [skinkie@archlinux build]$ ./duckdb /tmp/test.duckdb
v1.0.1-dev3841 bfeac16a91
Enter ".help" for usage hints.
D CREATE TABLE stop_times AS SELECT * FROM read_csv('/tmp/stop_times.txt', delim=',', header=true, auto_detect=true, columns = {"trip_id": "VARCHAR", "stop_sequence": "INTEGER", "stop_id": "VARCHAR", "stop_headsign": "VARCHAR", "arrival_time": "VARCHAR", "departure_time": "VARCHAR", "pickup_type": "INTEGER", "drop_off_type": "INTEGER", "timepoint": "INTEGER", "shape_dist_traveled": "FLOAT", "fare_units_traveled": "FLOAT"});
100% ▕████████████████████████████████████████████████████████████▏ 
D 

(venv) [skinkie@archlinux build]$ ls -l /tmp/test.duckdb 
-rw-r--r-- 1 skinkie skinkie 475017216 Aug  6 13:41 /tmp/test.duckdb
(venv) [skinkie@archlinux build]$ ls -l /tmp/stop_times.txt 
-rw-r--r-- 1 skinkie skinkie 901975145 Aug  5 17:07 /tmp/stop_times.txt
(venv) [skinkie@archlinux build]$ ./duckdb /tmp/test.duckdb
v1.0.1-dev3841 bfeac16a91
Enter ".help" for usage hints.
D drop table stop_times;
D checkpoint;
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows  │
└─────────┘
D 

(venv) [skinkie@archlinux build]$ ls -l /tmp/test.duckdb 
-rw-r--r-- 1 skinkie skinkie 274432 Aug  6 13:42 /tmp/test.duckdb
(venv) [skinkie@archlinux build]$ ls -l /tmp/test.duckdb 
-rw-r--r-- 1 skinkie skinkie 274432 Aug  6 13:42 /tmp/test.duckdb
(venv) [skinkie@archlinux build]$ ./duckdb /tmp/test.duckdb
v1.0.1-dev3841 bfeac16a91
Enter ".help" for usage hints.
D CREATE TABLE stop_times AS SELECT * FROM read_csv('/tmp/stop_times.txt', delim=',', header=true, auto_detect=true, columns = {"trip_id": "VARCHAR", "stop_sequence": "INTEGER", "stop_id": "VARCHAR", "stop_headsign": "VARCHAR", "arrival_time": "VARCHAR", "departure_time": "VARCHAR", "pickup_type": "INTEGER", "drop_off_type": "INTEGER", "timepoint": "INTEGER", "shape_dist_traveled": "FLOAT", "fare_units_traveled": "FLOAT"});
100% ▕████████████████████████████████████████████████████████████▏ 
D 

(venv) [skinkie@archlinux build]$ ls -l /tmp/test.duckdb 
-rw-r--r-- 1 skinkie skinkie 474755072 Aug  6 13:50 /tmp/test.duckdb
(venv) [skinkie@archlinux build]$ ./duckdb /tmp/test.duckdb
v1.0.1-dev3841 bfeac16a91
Enter ".help" for usage hints.
D drop table stop_times;
D 

(venv) [skinkie@archlinux build]$ ls -l /tmp/test.duckdb 
-rw-r--r-- 1 skinkie skinkie 274432 Aug  6 13:51 /tmp/test.duckdb

skinkie avatar Aug 06 '24 11:08 skinkie

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] avatar Nov 05 '24 00:11 github-actions[bot]

This issue was closed because it has been stale for 30 days with no activity.

github-actions[bot] avatar Dec 05 '24 00:12 github-actions[bot]