Validate parquet row group size and HDFS block size
The OutputFormat should verify that parquet.block.size < dfs.blocksize to avoid bad performance. In addition, we could check that (dfs.blocksize % parquet.block.size) < 1MB to ensure that some number of row groups is approximately the size of an HDFS block.
Reporter: Ryan Blue / @rdblue
Related issues:
- Improve alignment between row groups and HDFS blocks (depends upon)
Note: This issue was originally created as PARQUET-166. Please see the migration documentation for further details.
Ryan Blue / @rdblue: The first part of this, ensuring that the row group size is less than the block size, was added in PARQUET-306 along with row group padding. We should determine whether the second part is worth doing. People will ignore warnings and would not appreciate errors.