DuckDB file keeps growing on data overwrite
What happens?
I am connecting to DuckDB from Julia using DuckDB.jl
In an initial run, a lot of data is crawled and stored in different tables. On subsequent runs, only the new data is added to the tables. I have verified the table sizes to be correct: the tables have the same number of rows when there is no new data available. However, the DuckDB file size gets bigger and bigger on every new execution (almost doubles on the second run etc.). SHOW TABLES does not show any new tables and i make sure to delete any views.
I am unsure if this is wrong usage on my end, a bug in DuckDB.jl or a bug in DuckDB. Here is all the code i use to interact with the database. I have tried updating existing tables with 2 different functions (df_add_to_table_drop_previous_duplicates and add_to_table_drop_previous_duplicates) but the file keeps growing. The only code interacting with DuckDB i didn't post were the table initializations with CREATE TABLE IF NOT EXISTS
To Reproduce
using DataFrames, DuckDB, DBInterface, Dates, Random
function generate_random_dataframe(n::Int)
rng = MersenneTwister(1234) # Set the random number generator seed for reproducibility
# Generate random float columns
float_columns = [rand(rng, Float64, n) for _ in 1:100]
# Generate random string columns
string_columns = [rand(rng, ["foo", "bar", "baz"], n) for _ in 1:2]
# Generate random DateTime column
start_date = Date(2000, 1, 1)
date_column = [start_date + Dates.Day(i - 1) for i in 1:n]
# Create the DataFrame
df = DataFrame()
df.date = date_column
df.identifier .= "id1"
for i in 1:20
df[!, Symbol("float_$i")] = float_columns[i]
end
for i in 1:2
df[!, Symbol("string_$i")] = string_columns[i]
end
return df
end
function init_duckdb_at_location(path::String)
conn = DBInterface.connect(DuckDB.DB, path)
return conn
end
function df_add_to_table_drop_previous_duplicates(df::DataFrame, conn::DuckDB.DB, table_name::String)
df_prev = read_table_into_dataframe(conn, table_name)
println("size prev: ")
println(size(df_prev))
df_new = vcat(df_prev, df)
unique!(df_new, [:identifier, :date], keep=:last)
println("size new: " )
println(size(df_new))
DBInterface.execute(conn, "DELETE FROM $table_name;")
DuckDB.register_data_frame(conn, df_new, "new_data")
DBInterface.execute(conn, "INSERT INTO $table_name SELECT * FROM new_data;")
DuckDB.unregister_data_frame(conn, "new_data")
end
function add_to_table_drop_previous_duplicates(df::DataFrame, conn::DuckDB.DB, table_name::String)
for col in names(df)
df[!, col] = replace(df[!, col], NaN => missing)
end
dropmissing!(df)
DuckDB.register_data_frame(conn, df, "new_data")
print("table size before: ")
println(size(read_table_into_dataframe(conn, table_name)))
DBInterface.execute(conn, """
DELETE FROM $table_name
WHERE EXISTS (
SELECT 1 FROM new_data
WHERE $table_name.identifier = new_data.identifier AND $table_name.date = new_data.date
);
""")
print("table size between: ")
println(size(read_table_into_dataframe(conn, table_name)))
DBInterface.execute(conn, "INSERT INTO $table_name SELECT * FROM new_data")
print("table size after: ")
println(size(read_table_into_dataframe(conn, table_name)))
DuckDB.unregister_data_frame(conn, "new_data")
end
function read_table_into_dataframe(db::DuckDB.DB, table_name::String = "data")
df = DataFrame(DBInterface.execute(db, "SELECT * FROM $table_name"))
return df
end
function test()
conn = init_duckdb_at_location("bugdb.db")
DuckDB.execute(conn, """
CREATE TABLE IF NOT EXISTS test_data (
date DATE,
identifier VARCHAR(255),
float_1 FLOAT,
float_2 FLOAT,
float_3 FLOAT,
float_4 FLOAT,
float_5 FLOAT,
float_6 FLOAT,
float_7 FLOAT,
float_8 FLOAT,
float_9 FLOAT,
float_10 FLOAT,
float_11 FLOAT,
float_12 FLOAT,
float_13 FLOAT,
float_14 FLOAT,
float_15 FLOAT,
float_16 FLOAT,
float_17 FLOAT,
float_18 FLOAT,
float_19 FLOAT,
float_20 FLOAT,
string_1 VARCHAR(255),
string_2 VARCHAR(255)
)
""")
println(DataFrame(DBInterface.execute(conn, "SHOW TABLES")))
df = generate_random_dataframe(100000)
display(first(df, 5))
# first method:
df_add_to_table_drop_previous_duplicates(df, conn, "test_data")
df_add_to_table_drop_previous_duplicates(df, conn, "test_data")
# size at this point: 17.8 MB
# for i in 1:10
# df_add_to_table_drop_previous_duplicates(df, conn, "test_data")
# end
# size after 10 overwrites: 107.2 MB
# second method:
# initial size after first fill: 12.3 kB
# add_to_table_drop_previous_duplicates(df, conn, "test_data")
# add_to_table_drop_previous_duplicates(df, conn, "test_data")
# size at this point: 17.8 MB
# for i in 1:10
# add_to_table_drop_previous_duplicates(df, conn, "test_data")
# end
# size after 10 overwrites: 107.2 MB
end
test()
OS:
MacOS And Linux
DuckDB Version:
(automatically installed with client)
DuckDB Client:
Julia 0.8.1
Full Name:
Jan-Hendrik Menke
Affiliation:
private project
Have you tried this on the latest master branch?
- [ ] I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- [X] I agree
Confirmed as happening with DuckDB CLI too
Hey, great that you were able to reproduce this with the CLI, could you please provide the SQL script you ran there that showed the issue? Thanks
I was able to confirm this too. It happens if I DELETE a bunch of rows and then attempt to re-insert the same rows into the same tables.
Not sure if its same issue but posting it here.
duckdb version '0.9.0' also tried with '0.9.1' Python version 'Python 3.11.5' MacOS 13.5.2 Ventura
df.shape
#(1166561, 30)
I save the dataframe to file via the following execute command.
with duckdb.connect(dbfile) as conn:
conn.execute(f"CREATE OR REPLACE TABLE {df_name} AS SELECT * FROM {df_name}")
DB file size after the very first run is 211.3 MB DB file size after re-run of same df is 422.9MB DB file size after re-run of same df is 423.1MB DB file size after re-run of same df is 422.9MB DB file size after re-run of same df is 423.1MB DB file size after re-run of same df is 423.6MB Now create new .db file connection but same dataframe New DB file size after the very first run is 211.3 MB
I should expect the db file size to remain constant at 211.3MB since the SQL command is a REPLACE So how is the file size increasing with repetition?
I was having this issue with Windows 10, Python 3.10 and Duckdb 8. I fixed it by upgrading to Duckdb 9.1.
I was having this issue with Windows 10, Python 3.10 and Duckdb 8. I fixed it by upgrading to Duckdb 9.1.
I just updated the DuckDB package and reran the script in the original post, but i still get the same behaviour.
I'm running Microsoft Windows [Microsoft Windows [Version 10.0.19045.3570]. Could be a bug with Julia.
This might be related: https://github.com/duckdb/duckdb/discussions/9665, but in that case it appears that upgrading to v0.9.2 fixed it for me (python, windows)
Hey chiming in here, I'm seeing similar behavior on a project that archives small JSON files in large quantities. I opened a repo with Python code and some sample data to reproduce the issue and details on running the code and context are there.
Summary is basically batch ingesting files appears to lead to larger duckdb file size compared to one complete ingest. More batches leads to higher sizes although the final number of files imported is constant. I've yet to try if EXPORTing the database and reimporting it leads to better compression.
@jzavala-gonzalez thanks for putting in so much effort to reproducing this! We'll look into this.
Found a quick workaround that seems to work for "resetting" the file size: Just export and import the database again.
In Python:
import os
import duckdb
# Export database
con = duckdb.connect('my_database.db') # has big file size
con.execute("export database 'my_database';")
con.close()
# Remove database
os.remove('my_database.db')
# Re-import database
con = duckdb.connect('my_database.db')
con.execute("import database 'my_database';")
con.close()
This still happens with DuckDB 1.0.
@hannes @szarnyasg
skinkie@archlinux netex]$ /home/skinkie/Sources/duckdb/build/duckdb delijn-test.duckdb
v1.0.1-dev1830 1826262171
Enter ".help" for usage hints.
D CREATE TABLE stop_times AS SELECT * FROM read_csv('/tmp/stop_times.txt', delim=',', header=true, auto_detect=true, columns = {"trip_id": "VARCHAR", "arrival_time": "VARCHAR", "departure_time": "VARCHAR", "stop_id": "VARCHAR", "stop_sequence": "INTEGER", "stop_headsign": "VARCHAR", "pickup_type": "INTEGER", "drop_off_type": "INTEGER", "shape_dist_traveled": "FLOAT"});
100% ▕████████████████████████████████████████████████████████████▏
D
[skinkie@archlinux netex]$ ls -l delijn-test.duckdb
-rw-r--r-- 1 skinkie skinkie 225193984 Jun 24 11:14 delijn-test.duckdb
[skinkie@archlinux netex]$ /home/skinkie/Sources/duckdb/build/duckdb delijn-test.duckdb
v1.0.1-dev1830 1826262171
Enter ".help" for usage hints.
D DROP TABLE IF EXISTS stop_times;
D CREATE TABLE stop_times AS SELECT * FROM read_csv('/tmp/stop_times.txt', delim=',', header=true, auto_detect=true, columns = {"trip_id": "VARCHAR", "arrival_time": "VARCHAR", "departure_time": "VARCHAR", "stop_id": "VARCHAR", "stop_sequence": "INTEGER", "stop_headsign": "VARCHAR", "pickup_type": "INTEGER", "drop_off_type": "INTEGER", "shape_dist_traveled": "FLOAT"});
D
[skinkie@archlinux netex]$ ls -l delijn-test.duckdb
-rw-r--r-- 1 skinkie skinkie 341848064 Jun 24 11:14 delijn-test.duckdb
@skinkie thanks for the example! Two questions:
- did you try running
CHECKPOINTafter dropping the table? - can you please attach
stop_times.txt?
@szarnyasg
I think we can confirm that this issue may be resolved, with and without CHECKPOINT after dropping the table.
Regarding to your first question:
cd /tmp
wget https://gtfs.ovapi.nl/openov-nl/gtfs-openov-nl.zip
unzip gtfs-openov-nl.zip stop_times.txt
(venv) [skinkie@archlinux build]$ rm /tmp/test.duckdb
(venv) [skinkie@archlinux build]$ ./duckdb /tmp/test.duckdb
v1.0.1-dev3841 bfeac16a91
Enter ".help" for usage hints.
D CREATE TABLE stop_times AS SELECT * FROM read_csv('/tmp/stop_times.txt', delim=',', header=true, auto_detect=true, columns = {"trip_id": "VARCHAR", "stop_sequence": "INTEGER", "stop_id": "VARCHAR", "stop_headsign": "VARCHAR", "arrival_time": "VARCHAR", "departure_time": "VARCHAR", "pickup_type": "INTEGER", "drop_off_type": "INTEGER", "timepoint": "INTEGER", "shape_dist_traveled": "FLOAT", "fare_units_traveled": "FLOAT"});
100% ▕████████████████████████████████████████████████████████████▏
D
(venv) [skinkie@archlinux build]$ ls -l /tmp/test.duckdb
-rw-r--r-- 1 skinkie skinkie 475017216 Aug 6 13:41 /tmp/test.duckdb
(venv) [skinkie@archlinux build]$ ls -l /tmp/stop_times.txt
-rw-r--r-- 1 skinkie skinkie 901975145 Aug 5 17:07 /tmp/stop_times.txt
(venv) [skinkie@archlinux build]$ ./duckdb /tmp/test.duckdb
v1.0.1-dev3841 bfeac16a91
Enter ".help" for usage hints.
D drop table stop_times;
D checkpoint;
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows │
└─────────┘
D
(venv) [skinkie@archlinux build]$ ls -l /tmp/test.duckdb
-rw-r--r-- 1 skinkie skinkie 274432 Aug 6 13:42 /tmp/test.duckdb
(venv) [skinkie@archlinux build]$ ls -l /tmp/test.duckdb
-rw-r--r-- 1 skinkie skinkie 274432 Aug 6 13:42 /tmp/test.duckdb
(venv) [skinkie@archlinux build]$ ./duckdb /tmp/test.duckdb
v1.0.1-dev3841 bfeac16a91
Enter ".help" for usage hints.
D CREATE TABLE stop_times AS SELECT * FROM read_csv('/tmp/stop_times.txt', delim=',', header=true, auto_detect=true, columns = {"trip_id": "VARCHAR", "stop_sequence": "INTEGER", "stop_id": "VARCHAR", "stop_headsign": "VARCHAR", "arrival_time": "VARCHAR", "departure_time": "VARCHAR", "pickup_type": "INTEGER", "drop_off_type": "INTEGER", "timepoint": "INTEGER", "shape_dist_traveled": "FLOAT", "fare_units_traveled": "FLOAT"});
100% ▕████████████████████████████████████████████████████████████▏
D
(venv) [skinkie@archlinux build]$ ls -l /tmp/test.duckdb
-rw-r--r-- 1 skinkie skinkie 474755072 Aug 6 13:50 /tmp/test.duckdb
(venv) [skinkie@archlinux build]$ ./duckdb /tmp/test.duckdb
v1.0.1-dev3841 bfeac16a91
Enter ".help" for usage hints.
D drop table stop_times;
D
(venv) [skinkie@archlinux build]$ ls -l /tmp/test.duckdb
-rw-r--r-- 1 skinkie skinkie 274432 Aug 6 13:51 /tmp/test.duckdb
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.
This issue was closed because it has been stale for 30 days with no activity.