marimo icon indicating copy to clipboard operation
marimo copied to clipboard

Automatic column summaries of the table component should have a upper row count limit

Open mutongx opened this issue 1 year ago • 3 comments

Describe the bug

I'm using marimo to work on large datasets with 100 million rows, and found the "run cell" button unresponsive sometimes. After some investigation, I found during the unresponsive time, the kernel is busy running the get_column_summaries function, which is issued by the table component in the frontend.

Calculating summaries for 100m rows will take huge amounts of time, and the unresponsive behavior will cause confusions to users. I think marimo should impose a hard limit on column summary feature. Maybe 1 million is a good number?

Or maybe we can try to run the column summary in a background thread. But this will be a huge change I think, and thread safety issue will be hard to deal with.

Environment

{
  "marimo": "0.7.14",
  "OS": "Linux",
  "OS Version": "5.15.0-52-generic",
  "Processor": "x86_64",
  "Python Version": "3.10.12",
  "Binaries": {
    "Browser": "--",
    "Node": "--"
  },
  "Requirements": {
    "click": "8.1.7",
    "importlib-resources": "missing",
    "jedi": "0.19.1",
    "markdown": "3.6",
    "pymdown-extensions": "10.9",
    "pygments": "2.18.0",
    "tomlkit": "0.13.0",
    "uvicorn": "0.30.3",
    "starlette": "0.38.2",
    "websockets": "12.0",
    "typing-extensions": "4.12.2",
    "ruff": "0.5.5"
  }
}

Code to reproduce

No response

mutongx avatar Aug 08 '24 06:08 mutongx

Thanks for reporting, @mutongx. I like your suggestion of limiting column summaries to 1m rows or less. If you don't mind, I'd definitely appreciate a PR.

akshayka avatar Aug 08 '24 06:08 akshayka

@akshayka Sure! I will take a look at it.

mutongx avatar Aug 08 '24 08:08 mutongx

@akshayka Sure! I will take a look at it.

Thank you!

akshayka avatar Aug 08 '24 17:08 akshayka