codon icon indicating copy to clipboard operation
codon copied to clipboard

Faster c++ word_count benchmark

Open stephenberry opened this issue 2 years ago • 2 comments

A little cleanup makes this benchmark about 3X faster in C++. This could be made significantly faster with simd, but these changes would typically exist in a simple C++ library.

stephenberry avatar Dec 12 '22 11:12 stephenberry

Thank you for your pull request. We require contributors to agree to our Contributor License Agreement (https://exaloop.io/cla.txt), and we don"t have @stephenberry on file. In order for us to review and merge your code, please email [email protected] to get yourself added.

cla-bot[bot] avatar Dec 12 '22 11:12 cla-bot[bot]

I sent an email to request being added as a contributor.

stephenberry avatar Dec 12 '22 11:12 stephenberry

This outputs different results on my test dataset (original gives 2633615 whereas this version gives 1367284). Haven't looked too deeply at the changes yet but maybe words are not being tokenized properly.

arshajii avatar Dec 16 '22 15:12 arshajii

Interesting. I didn't make it support carriage returns. Does your input file have them? My test file reads in the dictionary multiple times and seems to work fine. If you could provide a sample of the input file it should be an easy fix.

stephenberry avatar Dec 16 '22 15:12 stephenberry

It actually gives different results on a file containing a b c (with newline at the end). Original gives 3 as expected, but the new version gives 2.

arshajii avatar Dec 16 '22 15:12 arshajii

Oh, I was assuming words were split by new lines and not spaces. Are all the words split by spaces?

stephenberry avatar Dec 16 '22 16:12 stephenberry

That's correct: the words are separated by 1 or more whitespace characters (e.g. ' ', '\t' etc.), so they're tokenized with Python's .split() or the istringstream used in the original C++ implementation.

arshajii avatar Dec 16 '22 16:12 arshajii

This fix should now work, it delineates on 1 or more spaces or tabs

stephenberry avatar Dec 16 '22 16:12 stephenberry

Thanks for the change -- although the current versions actually split on runs of any whitespace character, so it's more than just spaces and tabs.

In general, though, I would rather not implement a custom parser/tokenizer in the C++ version while the Python version uses the off-the-shelf .split(). Also I'd like to keep both of them as "read file line by line" instead of having the C++ version read the entire file immediately (could do the same in Python with file.read()). I do think the current implementation is how this type of program would typically be written. Going to close this one for now.

arshajii avatar Dec 17 '22 14:12 arshajii

There are literally dozens of libraries that do this for you. I just didn't want to bring in a library. You're not comparing compiler performance if you don't use efficient code, you're just comparing different implementations. It makes your benchmark irrelevant. If you were writing libraries it's fine to show benchmarks like this, but for a compiler this really isn't acceptable.

stephenberry avatar Dec 17 '22 15:12 stephenberry

Stephen, the changes you’re suggesting 1) change the algorithm fundamentally by reading the entire file instead of reading it line by line, and 2) implement a new tokenizer instead of using the standard library. I don’t see how this is a fair comparison to an implementation that does neither of those things.

In a comparison like this, the inherent efficiency of the implementation is not as important as making sure the different implementations are actually doing the same thing, insofar as the languages will allow. Otherwise, we would just implement the C version in inline assembly, and we could do the same in Codon using inline LLVM.

The point of these benchmarks is not to measure “peak C++ performance” vs. “peak Codon performance”, but instead to compare a “reasonable C++ implementation” against a “reasonable Python/Codon implementation”. As I said in the issue you opened, the Codon versions can themselves be made substantially faster in various ways, but we haven’t done that for the reasons listed above.

More than happy to continue this discussion or answer any follow up questions in Slack.

arshajii avatar Dec 17 '22 15:12 arshajii

Normally a benchmark should optimized to the task at hand, but if you want to require streaming line by line, that can be optimized as well. So, I can submit a pull request for that version. I just spent a few minutes to solve the problem in a more performant manner.

The current code could easily use std::isspace to handle all whitespace as well so it isn't a custom tokenizer. Thanks for the slack link.

stephenberry avatar Dec 17 '22 16:12 stephenberry

I think you're missing the point of my comment -- manually optimizing / adding your own way of parsing the file becomes an apples-to-oranges comparison. The current implementations are apples-to-apples: read the file line by line, partition into words using standard library, count words using a dictionary.

For example, the changes you made to the binary_trees benchmark were reasonable in this regard and still apples-to-apples: they didn't change anything algorithmically, didn't involve some new/optimized code to accomplish the task, and generally that's how that code could be written in practice.

I would really appreciate it if we could discuss before you spend time to create/submit any more PRs on this as I want to make sure we're on the same page.

arshajii avatar Dec 17 '22 16:12 arshajii

Note that many of these optimizations should be handled by the compiler itself. One of the goals of Codon project is to let users write simple code that can be optimized by the compiler. Codon code can also be manually optimized, but that's not the point here. We know that C(++) code can always match or beat Codon, but the question is at what cost: would you rather work with 5 line Codon code or 300 line long C++ code?

The fact that C++ standard library is not sufficiently optimized is honestly not our problem. Even then, an argument can be made for using unordered_map / istringstream because the intent is obvious and you won't run into buffer overflow issues.

inumanag avatar Dec 17 '22 17:12 inumanag