visidata
visidata copied to clipboard
Summarizing column (describe-sheet) is much slower (+300%) with typed date column
When summarizing all columns (Ctrl+I / describe-sheet) on a large CSV file the computation is much more slower when a date column is typed as date compared basic summarizing.
This regression is not seen with other typed column (integer)
the test is done on a 2M file with 9 columns that can be downloaded here https://github.com/datablist/sample-csv-files/raw/main/files/people/people-2000000.zip
once uncompressed, this will produce people-2000000.csv
here is a first script to describe all columns with only the 1st column typed as integer
#!vd -p
{"sheet": null, "col": null, "row": null, "longname": "open-file", "input": "people-2000000.csv", "keystrokes": "o", "comment": null}
{"sheet": "people-2000000", "col": "Index", "row": "", "longname": "type-int", "input": "", "keystrokes": "#", "comment": "set type of current column to int"}
{"sheet": "people-2000000", "col": "", "row": "", "longname": "describe-sheet", "input": "", "keystrokes": "Shift+I", "comment": "open Describe Sheet with descriptive statistics for all visible columns"}
which takes 22 to run on my computer
$ time vd -b -p describe_i.vdj
saul.pw/VisiData v3.0.2
opening describe_i.vdj as vdj
Support VisiData: https://github.com/sponsors/saulpw
opening people-2000000.csv as csv
set type of current column to int
open Describe Sheet with descriptive statistics for all visible columns
replay complete
real 0m22,657s
user 0m21,951s
sys 0m0,717s
and a second script with only the date type ("Date of birth" column)
#!vd -p
{"sheet": null, "col": null, "row": null, "longname": "open-file", "input": "people-2000000.csv", "keystrokes": "o", "comment": null}
{"sheet": "people-2000000", "col": "Date of birth", "row": "", "longname": "type-date", "input": "", "keystrokes": "@", "comment": "set type of current column to date"}
{"sheet": "people-2000000", "col": "", "row": "", "longname": "describe-sheet", "input": "", "keystrokes": "Shift+I", "comment": "open Describe Sheet with descriptive statistics for all visible columns"}
and this one takes 1min28 seconds to run
describe_d.vdj
saul.pw/VisiData v3.0.2
Support VisiData: https://github.com/sponsors/saulpw
opening describe_d.vdj as vdj
opening people-2000000.csv as csv
set type of current column to date
open Describe Sheet with descriptive statistics for all visible columns
replay complete
real 1m28,849s
user 1m28,088s
sys 0m0,774s
88 seconds is exactly 4 times longer compared to 22 seconds
there is no difference in the computation time between typing the 1st column as integer or leaving it
One can expect that typing a column as a date would not takes 300% longer for summarizing the data
This is tested with visidata v3.0.2 and Python 3.11.4-4 on Ubuntu 23.10