visidata Suggest performance improvements here

Suggest performance improvements here

Open kiedanski opened this issue 2 years ago • 7 comments

Hi, I really like the idea behind visidata and I would love to be able to use it as my main tool for everyday tasks (as an ML Engineer).

Unfortunately, for most datasets, calculating simple statistics about the data takes a lot longer than simply opening an IPYTHON terminal and running the query there. As a consequence, I find myself starting to use it only to stop after a while because what I wanted to do takes too long in Visidata, but I know it is quite fast in pure ipython.

Is it possible to image that visidata can achieve the same speed as the ipython console in the future?

Aug 02 '21 16:08 kiedanski

Hi @danski3456, this is not a very specific request. It would be useful if you could share some specific examples of how there is a speed difference.

You've already filed https://github.com/saulpw/visidata/issues/1050, which I take as such an example. This has the attention of the Visidata developers, but it would be very helpful if there was a test dataset.

Visidata doesn't claim to be the fastest, but it does claim to be responsive. It provides visual feedback as it calculates.

Having said that, I believe there is interest to make Visidata have great performance. The assistance you provide to show how Visidata takes significantly longer than ipython would be helpful.

Aug 08 '21 23:08 frosencrantz

Thanks for the feedback, @danski3456. VisiData's performance can certainly be improved in many ways, but @frosencrantz is right, we need specific use cases to optimize. Also we have to take into account what "a lot" is. A good rule of thumb is that if something takes 10x as long as in iPython, then it's probably worth optimizing; 2x as long, maybe not.

Severe performance differences (the 10x category) merit their own separate issues. But I suggest, let's turn this into a thread of lesser performance issues. There may be an easy way to mitigate them, maybe just a command to add to your visidatarc that you can use.

The ideal performance issue would include:

a case where VisiData takes N seconds and iPython takes N/10 seconds
a .vdj commandlog of the actions taken (the last command should be the slow one)
the actual Python code that takes 1 second to do the same
link to an actual dataset to test with
number of rows in the dataset

Next time I go on an optimization spree, I'll pick some items from this thread. And of course just seeing these issues and the areas that they are in, plants the seed to give them some attention. So please, @danski3456 and anyone, feel free to share use cases. Take a few minutes to fill out the template above and your wish might come true :genie:

Also, continue to file "interface freeze" issues like #1036. Freezing for only a couple of seconds already feels pretty terrible, but if you ever have to kill -9 vd then please do file a separate issue. That's a bug, not a performance issue.

Sep 16 '21 05:09 saulpw

So I found out about VD earlier today and played with it a bit and it's a really nifty thing for small(ish) datasets, but for bigger ones it starts to falter. Let's take for example: https://www.kaggle.com/kaggle/us-baby-names?select=StateNames.csv

That's a ~150MB data set. Let's say I load it with vd and want to sort by "Count" column. Loading itself is quite slow:

$ time vd --batch StateNames.csv
opening StateNames.csv as csv
Success is a poor teacher.
vd --batch StateNames.csv  31.11s user 1.20s system 100% cpu 32.281 total

Compared with pandas:

$ time python -c 'import pandas as pd; df = pd.read_csv("StateNames.csv")'
python -c 'import pandas as pd; df = pd.read_csv("StateNames.csv")'  1.50s user 0.59s system 131% cpu 1.587 total

I recorded the "open file, sort by Count" and the run was like this:

$ time vd --batch -p StateNames.vd
opening StateNames.vd as vd
"StateNames.csv"
opening StateNames.csv as csv
I wonder what they'll do next!
set type of current column to int
sort descending by current column; replace any existing sort criteria 
replay complete
vd --batch -p StateNames.vd  38.82s user 0.96s system 100% cpu 39.773 total

Something similar in pandas:

$ time python -c 'import pandas as pd; df = pd.read_csv("StateNames.csv"); print(df.sort_values("Count").tail(1))'
              Id    Name  Year Gender State  Count
3772924  3772925  Robert  1947      M    NY  10023
python -c   1.95s user 0.57s system 121% cpu 2.073 total

And we are in a 20:1 speed difference territory compared with direct pandas usage. Obviously interactivity of vd is awesome, but if these use cases could be improved that would be great.

Edit: I noticed I can do $ vd -f pandas StateNames.csv which basically makes vd performance on-par with pandas (guessing using different backend. Now the question is - can I use pandas "backend" for new sheets I create (say I merge/append multiple parquet sheets into one big one).

Sep 28 '21 15:09 sochotnicky

I suspect the "append" performance at least is impacted by following part: https://github.com/saulpw/visidata/blob/666e30329878b74678d24c300e1c50ceb7aa0666/visidata/join.py#L262

Where the code iterates row by row. At least with pandas it is possible to concat whole dataframes quickly: https://pandas.pydata.org/docs/reference/api/pandas.concat.html

Not sure how easy it would be to optimize this case (merging of 2+ pandas-backed sheets)?

Sep 28 '21 17:09 sochotnicky

From experience in previous optimization passes, simple iteration over every row is not that slow. For comparison you can try g" and see how fast that is.

If you are interested in digging further into this, you can turn on profiling with Ctrl+_ (toggle-profile), run the slow command, and then go to the Threads Sheet with Ctrl+T and press Enter on the slow thread (probably the last one) to look at the profiling results.

Sep 28 '21 17:09 saulpw

I was talking about roughly 0.5 mil rows merge. To give you an idea wrt profiling:

That's 2.5 minutes to merge compared with a few seconds using direct pandas.concat

Sep 28 '21 17:09 sochotnicky

From #1050:

For categorical columns (100~ unique items), the frequency table has an acceptable performance (is not fast, but not terribly slow). When trying to use the frequency table for numerical attributes, however, the performance degrades to the point that I need to close the application. Running value_counts using pandas from an IPython terminal is almost instantaneous.

Nov 17 '21 03:11 saulpw

From #2003:

Given a parquet file with pathologically large (~2 GB) string values, generated as follows:

import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array(["data"*(2**28)] * 2, type=pa.large_string())
data = [arr]*2
batch = pa.RecordBatch.from_arrays(data, ['field_1', 'field_2'])
table = pa.Table.from_batches([batch])

pq.write_table(table, 'pyarrow_2GB_large_string.parquet', compression=None)

VIsidata takes 20 seconds to to display any data. Also, switching between rows or columns takes between 8 and 14 seconds, depending on whether the columns are set to display the length of the string or the value of the string respectively.

Note that it takes duckdb over 2 minutes to display any output, so Visidata is already winning. Polars takes 20 seconds, so equivalent to Visidata.

I imagine it should be possible to cache the view such that moving the cursor remains instantaneous. Also, I wonder if it's necessary to load the whole file, or if it's possible to query just enough of the file to display on the screen, which would be significantly less than 2 GB.

Further investigation:

Using a Polars lazy_frame, and requesting only the first 10 characters of each column we can reduce the time to display data to 10 seconds. So there might be something to gain here. The sample file above is only two rows and two columns. It would be interesting to see how the the time saving scales to larger numbers of rows and columns:

In [1]: import polars as pl
In [2]: df = pl.scan_parquet('pyarrow_2GB_large_string.parquet')
In [3]: %time df.select([pl.col('field_1').str.slice(0, 10), pl.col('field_2').str.slice(0, 10)]).collect()
CPU times: user 5.2 s, sys: 4.93 s, total: 10.1 s
Wall time: 5.54 s
Out[3]:
shape: (2, 2)
┌────────────┬────────────┐
│ field_1    ┆ field_2    │
│ ---        ┆ ---        │
│ str        ┆ str        │
╞════════════╪════════════╡
│ datadatada ┆ datadatada │
│ datadatada ┆ datadatada │
└────────────┴────────────┘

The same trick in duckdb reduces time to display data to only 3 seconds:

select left(field_1, 10), left(field_2, 10) from 'pyarrow_2GB_large_string.parquet';

I'm not sure how to do this directly with pyarrow however.

Sep 02 '23 13:09 daviewales

I noted a performance regression moving between two dev versions.

The earlier dev version was to test the command palette. Today, I installed a version that addresses #192. I wonder if the extra try statements or getch in 3e7b3f3 is causing the slowdown? Sorting a column of a ~2GB csv file now takes 21s where it used to take 16s.

Nov 15 '23 15:11 reagle

@reagle Unfortunately I can't seem to repro the case you're talking about. Would you be able to use git bisect to detect when the performance degrades for you?

Nov 30 '23 04:11 saulpw

@sochotnicky and @kiedanski So I looked into these a bit, and I think that if we want to seriously address performance issues with larger data, we'd want to change the architecture and/or use something like pandas or DuckDB more proactively. I don't think it's worth it to optimize what we've got at the moment, although if there are cases where it hangs indefinitely and you have to kill the app, we might still look into those.

Thanks to everyone who participated in this thread. I'm going to close this one because I don't see anything else I want to take action on. I'm also not sure the "sticky performance thread" worked for us too well in practice. Please file new and individual issues with concrete use cases for perceived performance problems in the future.

Nov 30 '23 04:11 saulpw

visidata visidata copied to clipboard

Suggest performance improvements here

visidata
visidata copied to clipboard