iopipe Possible sample apps for iopipe

Possible sample apps for iopipe

Open jondegenhardt opened this issue 7 years ago • 53 comments

Just saw the video of the Boston meetup where you presented this library. It sounds like a natural fit for the TSV tools I've open-sourced: tsv-utils-dlang.

They're not exciting apps, but they are real apps doing real work. They are already fast. Run-time is dominated by IO. They mostly use byLine for input. They might be easy to try with your library, in part because each is a standalone app rather than single large application. It'd be quite impressive if plugging in your library sped them them up in a meaningful way.

The best opportunity for improvement might be the csv2tsv tool. It's easily the slowest of the tools. I haven't tracked down root cause, but it's almost certainly due to byte-by-byte output. The tsv-select tool, which is essentially cut with field re-ordering, is just a hair slower than GNU cut (on the latest LDC release it's faster than the benchmark I published). Improving that would be meaningful.

If you're interested in trying this I'd be game to help you with it.

Dec 18 '16 23:12 jondegenhardt

Hi Jon,

I remember you from the silicon valley meetup that I remotely presented at! and maybe we met at dconf 2016? In any case, I would love to work with you on getting iopipe to enhance your utilities. At the moment, I am strapped for time, but I will contact you about this after the new year. The nice thing about collaborating is that your tools should become much faster, and that iopipe will have some real usage to play with, so win-win! I like how thorough your unit tests are, it should help with making sure iopipe supports unicode fully.

A question for you, what OSes do you support with these tools? I have not written a Windows-supported low-level stream class, but that shouldn't be too hard (I need to do it anyway).

Dec 23 '16 15:12 schveiguy

Hi Steve,

Sounds great. I'll be traveling until after the new year anyway, so that timing works for me as well. I didn't know if you would remember from the Silicon Valley meetup, but I did ask questions! Not from dconf, I've never been to dconf.

As to OSes - I use and test them on linux and OS X. I don't know if anyone has ever tried using them on Windows. The only portability issue on my radar is whether newline is being handled correctly in all cases. Frankly, I've found it difficult to tell from the D documentation how to ensure newline is handled in a portable way.

Should be beneficial to both projects that there are natural benchmarks for these tools. It'd be easy to generate before and after iopipe comparison. That'd be useful if you propose a phobos addition, or if you present iopipe at dconf or elsewhere. There are also some natural external comparison points, both with traditional unix tools and also with similar functionality available in other toolkits.

Beneficial from my perspective as well. A thesis I have is that D should make writing reasonable performant software without engaging in low level programming. Having these tools compete well with other tools bolsters that case.

Dec 23 '16 19:12 jondegenhardt

Hi @jondegenhardt, been a while since I looked at this.

I just created a fork of tsv-utils, and updated tsv-select to use iopipes, which wasn't very difficult. I added a dependency to iopipe. You'll have to clone it locally, as I have not registered with code.dlang.org yet.

I can't find how you run your benchmarks, so I figured maybe you can help with this? Given my tests with straightforward phobos byLine vs. iopipe, this should be a significant improvement.

I have not altered the output mechanism yet, but that shouldn't be too difficult. Since I haven't done any testing or optimization exactly on the output part, I wanted to see the effect of changing the input system first.

Edit: the branch is here: https://github.com/schveiguy/tsv-utils-dlang/tree/tryiopipe

Feb 25 '17 02:02 schveiguy

Fantastic! I'll give it a try this weekend (or even tonight). Won't take very long to try it out. It'll be quite impressive if tsv-select improves, it's already fast.

Also - Do you want to know how I run the benchmark? It's straightforward. Main reason I haven't published it is that I use rather large data files. They are public machine learning data sets, so I can provide the locations.

Feb 25 '17 02:02 jondegenhardt

I should have an initial timing test shortly. There's a bug though - Haven't investigated fully, but it appears the newline is being retained. stdio.File.byLine has a template parameter, keepTerminator, that defaults to No, so the newline gets removed. Looks like that's not happening.

You can see this by building with dub in the root directory (dub run -- --compiler=ldc2. Then 'cd tsv-select', and 'make test-nobuild'. This runs tests against the executable and diffs against output. If you look at the diffs, you'll see that every run that asked for the last field on the line has an extra blank line, suggesting it was written out with an extra newline appended.

Feb 25 '17 03:02 jondegenhardt

It's slower with iopipe. Baseline: 4.22 sec; iopipe: 5.66 sec

I made sure I was building with the same flags. Is iopipe decoding to dchar? That would make a difference. My baseline code avoids that, because with utf-8 input TSV delimiters are always ascii. There is decoding to dchar when necessary, for example, case insensitive string compare in tsv-filter. But tsv-select only needs to identify delimiter characters (newline and tab).

Feb 25 '17 04:02 jondegenhardt

Tests are run against the HEPMASS data set from the UCI machine learning repository: http://archive.ics.uci.edu/ml/datasets/HEPMASS. The file is the all_train.csv.gz file from here: http://archive.ics.uci.edu/ml/machine-learning-databases/00347/

Unzip it. This file doesn't use CSV escapes, so you can run against the csv as:

$time tsv-select -d , -f 1,8,19 all_train.csv > /dev/null

Or, convert to TSV first with csv2tsv (csv2tsv all_train.csv > all_train.tsv) and run:

$time tsv-select -f 1,8,19 all_train.tsv

For consistent runs, warm-up any disk cache by running something wc -l against the target file first.

Feb 25 '17 05:02 jondegenhardt

Thanks. I was expecting an easy win, but should have known it would be more challenging :)

Many reasons I can think of for the poor performance -- I only tested on dmd, so my numbers might be skewed; phobos byLine may have improved in the meantime; some unforeseen interaction between the data set and the code; cache or inlining issues; etc. I'm sure I can figure out what is happening and speed it up.

Is iopipe decoding to dchar?

No, the byLine encodes the given dchar, and looks specifically for that sequence. It has special paths for one element, so it's not checking sequences in the hot code. But maybe still room for improvement.

it appears the newline is being retained

Yes, iopipe byline is not the same thing as Phobos byline. It simply regulates the release of data to the downstream pipe by looking for a delimiter. It can't remove the delimiter because that's in the stream. It allows interesting things like always keeping 3 lines in context for instance.

I will make a range adapter that removes the newline.

Feb 25 '17 13:02 schveiguy

Fall 2015, just as I started programming in D, there was a significant byLine speedup. Version 2.068.0. PR: https://github.com/dlang/phobos/pull/3089 Forum: http://forum.dlang.org/thread/[email protected]

Feb 25 '17 19:02 jondegenhardt

My tests were done after that. After dconf 2016 actually.

Feb 25 '17 20:02 schveiguy

Some update: using the latest compiler/phobos, iopipe is still faster with dmd with the test I was running (which wasn't nearly as large as yours, only about 400k lines), but only by a tiny bit. I'm downloading the test file you have now to see if I see the same results with dmd on that. If that still sees the same results, then I'll try ldc.

Feb 26 '17 00:02 schveiguy

with dmd on your test file, iopipe is much much slower, even with dmd. I can work on this and figure out why that's happening.

Feb 26 '17 00:02 schveiguy

@jondegenhardt I have updated iopipe. Made some really dumb errors, you can see with the last commit how they are fixed. Basically, I messed up an optimization when the buffer is an array, so was using the fallback implementation. I also improved the fallback implementation to use a local vs. looking up data from the struct.

On my simple test (basically counts lines in a file) running on your test case, I am now very close to matching the performance of Phobos, but not beating it (7 seconds for iopipe, 6.4 for Phobos). This is a bit disappointing since I was thinking my code was 2x faster. My original test case has now returned to 2x faster, but it's so small that could be due to some sort of set up that the Phobos one is doing (the times are miniscule).

Testing tsv-select I see similar results (14.4s for iopipe vs. 13.7s for Phobos). This is still with DMD, so possibly iopipe version will be better or worse compared to Phobos on ldc/gdc with its superior inlining capability.

Feb 26 '17 01:02 schveiguy

Okay, I'll pull your updates and see what I get. (Can't do it right this moment, though).

Don't know if you saw the benchmarks I published a few days ago, but one curious bit is that tsv-select is quite a bit faster than GNU cut on large files, but GNU cut is faster on small files. Somewhat similar.

Feb 26 '17 01:02 jondegenhardt

Pulled the latest and ran with LDC 1.1. Definitely faster, but still not quite matching Phobos performance. phobos: 4.20; iopipe: 4.91.

I also tried add a couple newer LDC compile flags: lto=full and -enable-cross-module-inlining. No change.

Feb 26 '17 03:02 jondegenhardt

There has to be something I'm missing. readlnimpl in Phobos copies every single line possibly twice, even with getline available. getline must be so optimized that it makes up for all those copies. Will look some more

Feb 26 '17 12:02 schveiguy

More data: Using just getline and not phobos results in an insignificant improvement. In other words, the copying costs almost nothing. Definitely not what I expected.

Feb 26 '17 15:02 schveiguy

~~OK, I now have it beating Phobos and getline: Changing the read size to 1MB instead of 8KB definitely helps. But my API doesn't allow this as a parameter to bufferedInputSource. I will update momentarily.~~

Feb 26 '17 15:02 schveiguy

I take it back. I did something really stupid -- I forgot to make sure nothing else was consuming CPU. I disabled some other programs, and now what was once 7s is now down to ~2 seconds, and phobos and getline are both still beating iopipe.

Feb 26 '17 16:02 schveiguy

And now, the copying is definitely having an effect -- 1.8s for straight getline, 2.1 for phobos, so a 15% effect. iopipe is still slower at 2.9s. I will figure this out.

Feb 26 '17 16:02 schveiguy

Got it 😃

I cheated a bit and looked at the source for getdelim. Discovered that it uses memchr to find the delimeter. peeked at that source, and found that it searches a word at a time instead of a byte at a time. This is why my loop is much slower.

I have updated iopipe to use memchr when possible (e.g. when code unit type is char, and delimeter is one code unit). I suspect we could improve things in other ways (e.g. using simd; or similarly optimized searching for multi-byte delimeters).

New times for my test: phobos = 2.1s, getline = 1.8s, iopipe 1.1s. That's a pretty significant drop, I would expect LDC-compiled iopipe to now outperform phobos on tsv-select. On my system with dmd, it's 9.5s (iopipe) vs 10.9s (phobos).

Feb 26 '17 16:02 schveiguy

One further note: I still need to make range adapter to remove newline.

Feb 26 '17 16:02 schveiguy

Very cool! I'll give it a try.

Feb 26 '17 18:02 jondegenhardt

Yes!! Phobos: 4.13; IOPipe: 3.22. That's with LDC 1.1.

Nice job. Looks like you've got a real win here. At least in the context of my programs, this is very impressive, as they are already beating similar tools by wide margins. The range adapter will have to operate without slowing it down of course.

A couple notes:

I built these by putting all the iopipe files directly on the ldc command line, rather than using the dub build approach, which builds the library separately. I did this in case there was a cross module optimization ldc could apply. I'll go back and try the dub build approach see if there's a difference.
You'll notice there are small deltas in the timing numbers I report, usually < 20ms. My machine is not completely quiet either, but I do make multiple runs every time, so I have good confidence that metrics reported together are done in equivalent circumstances.

Update: I get the same numbers with the dub build.

Feb 26 '17 19:02 jondegenhardt

I have a theory with the performance difference between "short" files and "long" ones. One property of your test file is that the lines are pretty long, over 700 characters for each line. An optimized search for newlines going 1 word at a time vs. one char is going to beat a straight-up byte-wise search, but doesn't negate the fact that you have to make an opaque call. With shorter lines, a for-loop beats a call to memchr. With longer lines, the memchr call's overhead is dwarfed by the speedup in the search. So my theory is that even with a large file, if it has shorter lines, then the for-loop might win.

The best of both worlds may be an inline-able implementation for searching that can do word-sized comparisons. In other words, reimplement what memchr does, and let the compiler find the most optimized path.

Feb 26 '17 20:02 schveiguy

Hadn't thought of that, makes sense. I've been aware that different file shapes may affect performance. I've done some spot checks, but nothing comprehensive. Mainly, it took enough time to assemble and run a single set of benchmarks across all the competing tools I tested. But yes, the hepmass file contains doubles at written in high precision, so it has long lines even though it's only 29 columns. Another set of files I use are the google ngram files. These are large in GB size, huge number of rows/lines, but each line is short (4 columns, 3 are small integers, 1 short text field).

Having the compiler inline it make sense. Made me wonder if the compiler might be able to do this (esp LDC/LLVM). I don't know the answer, but a google search turned up a blog post that might be relevant: https://gms.tf/stdfind-and-memchr-optimizations.html

Feb 26 '17 20:02 jondegenhardt

Ah - Just realized, there's a simple data point that can be used to help test your hypothesis: Run tsv-select against a smaller hepmass file. eg. The first 100,000 lines, or whatever number results in an approx 200-300MB file. This would preserve the per-line size characteristics. Because yes, the smaller file I tested against did have a different shape. Not as radically different as the google ngram files, but still different. I'll run this later - Have to take off for a while.

Feb 26 '17 20:02 jondegenhardt

Update: added mechanism to omit delimiters for byline. Updated my branch of tsv-utils to use it. Note: files that end in newline show an extra element. I'm not sure what the correct way to do this is, because traditionally, files that don't end in newline are flagged as errors by my editor. However, a traditional "split" of data based on a delimiter should give a blank element on such a case. Indeed in a csv row, "a,b,c," would be a 4-field row.

In any case, tsv-select complains about it.

Note that performance is still good with the new range adapter.

Feb 26 '17 21:02 schveiguy

Thinking about it more, I need to eliminate that extra empty "row", as it worked fine before trimming the delimiter. I'll think about how I can refactor the code to do so.

Feb 26 '17 22:02 schveiguy

Pushed a new version that fixes the last line issue.

Feb 26 '17 22:02 schveiguy

iopipe iopipe copied to clipboard

Possible sample apps for iopipe

iopipe
iopipe copied to clipboard