presto-hive-apache
presto-hive-apache copied to clipboard
Upgrade to parquet 1.8.1
@dain @electrum here is my draft. Need to update presto parquet code namespaces when a new presto-hive-apache version is released
This looks good, but I like to have the Presto changes ready first instead so that we know if anything is missing. I made most of the changes here in this branch so that it compiles:
https://github.com/electrum/presto/tree/parquet
With the exception of TestParquetTimestampUtils
which uses NanoTimeUtils
from Hive which references the old Parquet classes. I'm wondering if there are other classes in Hive that we use which will need the old classes (and would fail with ClassNotFoundException
or similar).
There are other parts of the code that imports hive's parquet classes so this change should be tested properly to see whether we hit old/new parquet package/class conflicts.
nyigitbasi@lgml-nyigitbasi /tmp/presto (master) $ grep -r org.apache.hadoop.hive.ql.io.parquet --include \*.java *
presto-hive/src/main/java/com/facebook/presto/hive/HiveStorageFormat.java:import org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat;
presto-hive/src/main/java/com/facebook/presto/hive/HiveStorageFormat.java:import org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat;
presto-hive/src/main/java/com/facebook/presto/hive/HiveStorageFormat.java:import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;
presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java:import org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat;
presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java:import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;
presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java: .add("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe")
presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetRecordCursorProvider.java: .add("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe")
presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetTimestampUtils.java: * This class is equivalent of @see org.apache.hadoop.hive.ql.io.parquet.timestamp.NanoTime,
presto-hive/src/test/java/com/facebook/presto/hive/parquet/ParquetTester.java:import org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat;
presto-hive/src/test/java/com/facebook/presto/hive/parquet/ParquetTester.java:import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;
presto-hive/src/test/java/com/facebook/presto/hive/parquet/TestParquetTimestampUtils.java:import org.apache.hadoop.hive.ql.io.parquet.timestamp.NanoTimeUtils;
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveFileFormats.java:import org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat;
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveFileFormats.java:import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;
@electrum @zhenxiao I am currently looking at upgrading presto to parquet 1.8.1 and hit some issues during runtime. Presto's hive-apache package depends on hive 1.2 and that version still uses the old parquet packages (parquet.*
), which causes these runtime issues.
For example, in my tests when querying a parquet table presto creates an instance of org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
and that class tries loading parquet.hadoop.ParquetInputFormat
, which it cannot find. How do we want to handle this? Do we want to upgrade to hive 2.0, which uses the new parquet packages, or do we want to handle it some other way?
@nezihyigitbasi MapredParquetInputFormat
is a simple class. Might be easiest to fork it? Or bypass it entirely? Given we have a native Parquet reader, it might not be needed anymore.
+1
@nezihyigitbasi -- what's the status of this PR?