etl-language-comparison icon indicating copy to clipboard operation
etl-language-comparison copied to clipboard

Standardize algorithms

Open fervic opened this issue 9 years ago • 1 comments

I see that contributions have taken different approaches for solving the same problem, so at the end the benchmark is no comparing the language itself.

My suggestion would be to set a guideline for contributing which explains the standard approach, like:

  • It should use files
  • Should have the amount of worker/threads to use as a parameter
  • Can buffer for writing but the buffer size has certain size limit.
  • Should use regular expressions or should include both versions: with and without regexps.

Maybe also allow submitting a non-standard approach that takes advantage of specific language features but keep that one marked as the special one.

So at the end it would be two sets of solutions: (1) the standard that follows the rules and (2) the optimized or non-standard.

fervic avatar Nov 02 '15 14:11 fervic

That's fantastic suggestion @fervic, I had similar thoughts that I was going to bring up in my next blog post. Here's what I was going to suggest.

Rules of Reference Implementation

  1. Stream input from files.
  2. Use Regular Expressions to check for the presence of knicks.
  3. Have multiple mappers, but one reducer.
  4. Each individual worker holds its results in a hash and sends that final hash back for reduction.

One suggestion you made that I don't have was to limit the # of workers/threads, but that's not always simple depending on the language and framework. Any other suggestions?

dimroc avatar Nov 15 '15 00:11 dimroc