parquet-java
parquet-java copied to clipboard
PARQUET-315: Add PARQUET_1_0 and non-repeated data performance tests …
Here's a list of changes I did on the parquet-benchmarks module:
- Add variable to test PARQUET_1_0 format
- Add variable to generate random data for the tests
- Move part of the DataGenerator.generate() code to constructors so that we measure time when when writing data only.
- Add annotation to display AverageTime. The Throughput mode was always displaying less than 1 op/s. So I thought time was a better mode here.
@nezihyigitbasi @danielcweeks You did a good job starting the parquet-benchmarks module. Could you please review these changes?
@spena thanks for taking the time to create this PR. Can you also update the README of the benchmark module?
@spena Added a few comments. Once you update the README and address the comments, I will be happy to give it a try. Thanks again!
Thanks @nezihyigitbasi for the feedback. I fixed the issues you mentioned. You will be able to generate data this time.
Thanks @nezihyigitbasi I added the changes based on your feedback. I liked your explanation about random data, so I replaced the old paragraph with yours.
I tried to figured out why the benchmark cannot be executed when compiling with hadoop-2, but I did not know why. So, I just replace the README to use the command with out the profile.
One thing. The read-benchmark.sh tests will fail when using parquetVersion=v2 and randomData=true due to the issue in PARQUET-152. If the patch is applied, then all tests run fine.
@spena To keep the fixes to the existing benchmark module separate I created a new PR: https://github.com/apache/parquet-mr/pull/226
Yep I verified that your change makes the read-benchmark go through.
@spena I only added a few minor comments to the README. Other than that LGTM. @rdblue I have provided feedback to @spena and verified his changes. When he addresses my final comments this patch LGTM. It's all yours now.
Thanks @nezihyigitbasi. I added the minor changes you mentioned.
Once #226 is committed, I will merge the code to my branch, and check it is applied correctly.
@nezihyigitbasi @rdblue I updated the branch with the latest changes of master (includes #226).
No more work to do here.
Thanks @spena, I'll take a look at this today.
@rdblue did you want to merge this? @spena do you need to merge master