visidata
visidata copied to clipboard
[tsv] decrease tsv buffer size for slow input streams
Right now when piping vd with tsv loader to a slow stream, it takes pretty long before you see any data. This is caused by https://github.com/saulpw/visidata/commit/08900503ce638461b6d380c4fe0ec7bad13bac6d
Example of slow stream:
for p in $(seq 0 100); do echo -e "a\tb"; sleep 1; done | vd -f tsv
It takes 100 seconds to show anything (when stream is finished). So i modified the buffer size to flush the buffer every second. this shows results for slow streams fast, but for fast streams, we still have the 10% speed increase of https://github.com/saulpw/visidata/commit/08900503ce638461b6d380c4fe0ec7bad13bac6d because the buffer size adapts itself to fast streams
- [x] If contributing a core loader, the loader checklist was referenced.
- [x] If registering an external plugin, the plugin checklist was referenced.
its possible, but it will probably be more of a hassle.
- os.read returns bytes, so will have to be converted with iterdecode
- os.read completely ignores all magic done in path.py:Path (i'm thinking about reading compressed files)
- i'm afraid it wont work in windows for e.g. stdin https://stackoverflow.com/questions/323829/how-to-find-out-if-there-is-data-to-be-read-from-stdin-on-windows-in-python
i'll try some stuff out and i'll get back to you, but i dont think it will be an elegant solution
Its as a expected:
import gzip
o = open("t.tsv.gz", "rb")
g = gzip.open(o)
print(o.fileno(), g.fileno())
gives:
3 3
So i cant use os.read for compressed files. I dont see any feasable solution with os.read
I fixed tests and added documentation
For performance tests, i did this, does it suffice, or do you want something integrated in the automated tests?
$ git checkout tsv
$ cat open-and-quit.vdj
#!vd -p
{"longname": "open-file", "input": "t2.tsv", "keystrokes": "o"}
{"sheet": "t2", "col": "", "row": "", "longname": "quit-all", "input": "", "keystrokes": "", "comment": ""}
$ >t.tsv; >t2.tsv; for P in $(seq 0 10000); do echo -e "a\tb\tc" >> t.tsv; done; for P in $(seq 0 2000); do cat t.tsv >> t2.tsv; done; du -hs t2.tsv
115M t2.tsv
$ time vd --play open-and-quit.vdj t2.tsv
real 3m15.688s
user 3m12.871s
sys 0m4.354s
$ git checkout stable
$ time vd --play open-and-quit.vdj t2.tsv
real 3m17.077s
user 3m14.395s
sys 0m4.155s
Thank you so much @SuRaMoN! This should now be on develop
.