dolt
dolt copied to clipboard
Table size calculation using DATA_LENGTH in information schema is naive and massively overstates the size of tables
DBeaver displays huge size for dolt tables (70GB). I tried this query and it reports huge size too:
SELECT TABLE_NAME AS `Table`,
ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024) AS `Size (MB)`
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = "test"
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;
However on the disk, my docker volume takes only 113MB.
Ex:
CREATE TABLE IF NOT EXISTS tasks (
task_id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255) NOT NULL,
start_date DATE,
due_date DATE,
status TINYINT NOT NULL,
priority TINYINT NOT NULL,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO tasks(title, start_date, due_date, status, priority, description, created_at)
VALUES('task', NULL, NULL, 1, 2, NULL, '2023-09-07 13:54:13');
insert into tasks(title, status, priority) select title, status, priority from tasks limit 1000000;
select count(1) from tasks;
I have approximatively 1.000.000 rows.
We have an existing issue for this:
https://github.com/dolthub/dolt/issues/5202
And there is a bigger project to collect table stats so we can populate these columns:
https://github.com/dolthub/dolt/issues/6161
The best way to see how big your database is for now is to just look at disk.
We should look in to the issue of why you get a number when you run the above query. My expectation would be those columns are NULL and the query would fail.
Hi @nmoreaud, thank you for using Dolt and filing this issue. We'd love to learn more about your use case too. Feel free to email me or swing by our Discord if you'd like to share.
So here is the code for DATA_LENGTH:
https://github.com/dolthub/dolt/blob/cf5516a4ca8712beac8c04043711f1c5bc5f37ae/go/libraries/doltcore/sqle/tables.go#L396
And here is comment in the code for INDEX_LENGTH saying it is not supported:
https://github.com/dolthub/go-mysql-server/blob/7898f495601614b0b1b42c1038c1ab545a0c2bae/sql/information_schema/information_schema.go#L1752
So, I think what is going on. here is data_length ignores compression and massively overestimates the size of the table. The code just takes the byte size of the columns and multiplies by the number of rows.
Another user just reported this. We're discussing whether to simply multiply our current result by a fixed "compression factor" to give a more reasonable result.
I had a call with a customer today who was also somewhat alarmed by seeing a simple table was taking up 10GB. Tools like DBeaver show this information somewhat prominently, so it makes sense that customers are concerned to see such large sizes reported.
@nmoreaud Thanks for reporting this issue. We've just merged in a change to how we estimate table size on disk that should get us into the right ballpark for the size estimate and not be massively overestimated anymore. This change is staged to go out in the next Dolt release – 1.35.13.
Today, we only have the schema information and the number of rows to estimate the table's size on disk, so we're limited in our accuracy since the data on disk can vary so much, particularly when the schema uses variable length fields such as varchar or TEXT. Longer-term, the table statistics project (https://github.com/dolthub/dolt/issues/6161) will enable us to have much more accurate table size estimate.
I'm resolving this issue since the massively overestimated table sizes should be more accurate now, and we'll continue tracking the longer-term work with statistics in the issue linked above. Please let us know if you run into any other issues or have any other feedback for us!
This change has been released in Dolt 1.35.13