Automatic column summaries of the table component should have a upper row count limit
Describe the bug
I'm using marimo to work on large datasets with 100 million rows, and found the "run cell" button unresponsive sometimes. After some investigation, I found during the unresponsive time, the kernel is busy running the get_column_summaries function, which is issued by the table component in the frontend.
Calculating summaries for 100m rows will take huge amounts of time, and the unresponsive behavior will cause confusions to users. I think marimo should impose a hard limit on column summary feature. Maybe 1 million is a good number?
Or maybe we can try to run the column summary in a background thread. But this will be a huge change I think, and thread safety issue will be hard to deal with.
Environment
{
"marimo": "0.7.14",
"OS": "Linux",
"OS Version": "5.15.0-52-generic",
"Processor": "x86_64",
"Python Version": "3.10.12",
"Binaries": {
"Browser": "--",
"Node": "--"
},
"Requirements": {
"click": "8.1.7",
"importlib-resources": "missing",
"jedi": "0.19.1",
"markdown": "3.6",
"pymdown-extensions": "10.9",
"pygments": "2.18.0",
"tomlkit": "0.13.0",
"uvicorn": "0.30.3",
"starlette": "0.38.2",
"websockets": "12.0",
"typing-extensions": "4.12.2",
"ruff": "0.5.5"
}
}
Code to reproduce
No response
Thanks for reporting, @mutongx. I like your suggestion of limiting column summaries to 1m rows or less. If you don't mind, I'd definitely appreciate a PR.
@akshayka Sure! I will take a look at it.
@akshayka Sure! I will take a look at it.
Thank you!