etl-language-comparison
etl-language-comparison copied to clipboard
Standardize algorithms
I see that contributions have taken different approaches for solving the same problem, so at the end the benchmark is no comparing the language itself.
My suggestion would be to set a guideline for contributing which explains the standard approach, like:
- It should use files
- Should have the amount of worker/threads to use as a parameter
- Can buffer for writing but the buffer size has certain size limit.
- Should use regular expressions or should include both versions: with and without regexps.
Maybe also allow submitting a non-standard approach that takes advantage of specific language features but keep that one marked as the special one.
So at the end it would be two sets of solutions: (1) the standard that follows the rules and (2) the optimized or non-standard.
That's fantastic suggestion @fervic, I had similar thoughts that I was going to bring up in my next blog post. Here's what I was going to suggest.
Rules of Reference Implementation
- Stream input from files.
- Use Regular Expressions to check for the presence of knicks.
- Have multiple mappers, but one reducer.
- Each individual worker holds its results in a hash and sends that final hash back for reduction.
One suggestion you made that I don't have was to limit the # of workers/threads, but that's not always simple depending on the language and framework. Any other suggestions?