parquet4s
parquet4s copied to clipboard
Can I trouble you for code review to integrated newer version of parquet4s into delta DSR/DSW?
I am not sure how much effort, just asking if you would be willing and available to look over PR. It's currently stack at 1.2.1 and it's just too far back. Causes troubles with deps and I really don't like shading if I have to use it as a workaround.
@MironAtHome Sure thing!
Most of the changes were to wrap file path expressed as string into .Path imported from latest version of parquet4s parquet implementation. However, two kinds of errors turned out a bit heavier to mend. 1, ParquetReader.read in CloseableParquetDataIterator.scala:154 2. ValueCodec with root of the trouble in RowParquetRecordImpl.scala:308 and a few lines, related to ValueCodec. I am not certain what to replace those with directly, if such a substitute readily available, please suggest, else, will put together a few lines to wrap new classes and expose same interface
I think that this can be helpful: https://mjakubowski84.github.io/parquet4s/docs/migration/
Regarding RowParquetRecordImpl
:
-
ValueCodec
's implementation is now split intoValueEncoder
+ValueDecoder
so here probably you want to useValueDecoder.intDecoder
-
private def customSeqCodec[T](elementCodec: ValueCodec[T])(implicit
seems to be redundant for me. The comment above the function sounds as not true to me.
Great thank you for your guidance and help. Started from this pr to ensure build. https://github.com/mjakubowski84/parquet4s/pull/257 The replacement of ValueCodec worked, down to the last 5 errors, all related to type passed as a generic to reading. If I don't get enough time to finish it today, will finish over Sat/Sun. And thank you for tending to PR above.
Hey Marcin, pr is ready https://github.com/delta-io/connectors/pull/303 please provide feedback
Two comment from me only. One is a minor code change but the other worries me. I see that Delta relies on some ancient version of parquet-hadoop. I wonder if that can be upgraded without any issue.
Hey Marcin, sorry for long time to turn around, I had to get over case of covid. I have ran into an issue with boolean type decode in the connector with edition 2.3.0 ( latest available on maven ) of parquet4s. Could you please look? I have created a private repo for this unit test. https://github.com/MironAtHome/connectors-private.git branch miron/integrate-parquet4s-23 here is the lines where I am getting assertion:
val b: Boolean = (i % 2 == 0)
val rowB: Boolean = row.getBoolean("as_boolean")
println(s"Evaluating row value ${rowB} to ${b} for (${i} % 2 == 0)} as ${(i % 2 == 0)}")
assert(row.getBoolean("as_boolean") == (i % 2 == 0))
here is assertion text, with a few trace rows printed prior:
---
[info] DeltaDataReaderSuite:
Evaluating row value true to true for (4 % 2 == 0)} as true
Evaluating row value false to false for (5 % 2 == 0)} as false
Evaluating row value true to true for (6 % 2 == 0)} as true
Evaluating row value false to false for (7 % 2 == 0)} as false
Evaluating row value true to true for (8 % 2 == 0)} as true
Evaluating row value false to false for (9 % 2 == 0)} as false
Evaluating row value false to true for (0 % 2 == 0)} as true
[info] - read - primitives *** FAILED ***
[info] false did not equal true (DeltaDataReaderSuite.scala:87)
---
Here is rows from the test table, as per spark read:
scala> spark.read.format("delta").load("./golden-tables/src/test/resources/golden/data-reader-primitives").show()
+------+-------+-------+--------+----------+--------+---------+---------+---------+--------------+
|as_int|as_long|as_byte|as_short|as_boolean|as_float|as_double|as_string|as_binary|as_big_decimal|
+------+-------+-------+--------+----------+--------+---------+---------+---------+--------------+
| 4| 4| 4| 4| true| 4.0| 4.0| 4| [04 04]| 4|
| 5| 5| 5| 5| false| 5.0| 5.0| 5| [05 05]| 5|
| 6| 6| 6| 6| true| 6.0| 6.0| 6| [06 06]| 6|
| 7| 7| 7| 7| false| 7.0| 7.0| 7| [07 07]| 7|
| 8| 8| 8| 8| true| 8.0| 8.0| 8| [08 08]| 8|
| 9| 9| 9| 9| false| 9.0| 9.0| 9| [09 09]| 9|
| null| null| null| null| null| null| null| null| null| null|
| 0| 0| 0| 0| true| 0.0| 0.0| 0| [00 00]| 0|
| 1| 1| 1| 1| false| 1.0| 1.0| 1| [01 01]| 1|
| 2| 2| 2| 2| true| 2.0| 2.0| 2| [02 02]| 2|
| 3| 3| 3| 3| false| 3.0| 3.0| 3| [03 03]| 3|
+------+-------+-------+--------+----------+--------+---------+---------+---------+--------------+
as per above, spark 3.2.1 with hadoop 3.3.1 reads values correctly, however, unit test produces assertion for value in the variable rowB containing "false" when expected value is "true" for the row where field "as_int" equals to 0.
This is affecting DeltaDataReaderSuite.scala
I would much appreciate your help to verify this issue.
Hi @MironAtHome.
I do not see the code as your repo is private. I recommend to have a look at boolean decoder and debug what happens there. I do not recall if business logic changed there since version 1.0 - probably not. And BTW - latest version of Parquet4s is now 1.4.1 :)
Marcin, it is so nice to have your comments. I apologies, I knew my repo setup security wasn't right for your access, but I thought that having you invited through issue link might open it up for you. Give me till EOD ( it's 9:47AM in Seattle right now ) to get to it, or ask you for further assistance. Will open access right now.
I took me time to debug those tests. Thousands of dependencies and buggy resource loading of golden tables.
So, the issue seems to be with the test data or with the test itself.
Tests fail on NullValue
. as_boolean
is set to be nullable. In effect here we cast null to Boolean - https://github.com/MironAtHome/connectors-private/blob/miron/integrate-parquet4s-23/standalone/src/main/scala-2.12/io/delta/standalone/internal/data/RowParquetRecordImpl.scala#L167. And null is resolved as false. So if the test expects true then it must fail.
Closing due to inactivity.
Ok. Let me revisit tests and code. Much time has passed, but it's worth revisiting. Just tried running _gym project. Between now and then I had to rebuild my machine ( computer ). And was a bit surprised to find debugger failing due to not finding HADOOP_HOME environment variable. After all this time I guess it's a bit late to ask, does parquet4s have dependency on hadoop being present and configured on the machine?
Hi!
Great to hear that.
No, there's no need to have Hadoop on your machine (I don't). And I don't have HADOOP_HOME
set, too.
Here is my stack dump, first with screenshot of debugger stepped into method caused exception:
Stack trace:
"C:\Program Files\Microsoft\jdk-11.0.12.7-hotspot\bin\java.exe" -agentlib:jdwp=transport=dt_socket,address=127.0.0.1:65487,suspend=y,server=n -javaagent:C:\Users\user\AppData\Local\JetBrains\IntelliJIdea2022.2\captureAgent\debugger-agent.jar -Dfile.encoding=UTF-8 -classpath "F:\HL\dev\git\delta-standlone\parquet4s-gym\target\scala-2.13\classes;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\aopalliance\aopalliance\1.0\aopalliance-1.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\ch\qos\logback\logback-classic\1.3.0-alpha14\logback-classic-1.3.0-alpha14.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\ch\qos\logback\logback-core\1.3.0-alpha14\logback-core-1.3.0-alpha14.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\beachape\enumeratum-macros_2.13\1.6.1\enumeratum-macros_2.13-1.6.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\beachape\enumeratum_2.13\1.7.0\enumeratum_2.13-1.7.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\chuusai\shapeless_2.13\2.3.7\shapeless_2.13-2.3.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-annotations\2.13.0\jackson-annotations-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-core\2.13.0\jackson-core-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-databind\2.13.0\jackson-databind-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\jaxrs\jackson-jaxrs-base\2.13.0\jackson-jaxrs-base-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\jaxrs\jackson-jaxrs-json-provider\2.13.0\jackson-jaxrs-json-provider-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\module\jackson-module-jaxb-annotations\2.13.0\jackson-module-jaxb-annotations-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\woodstox\woodstox-core\5.3.0\woodstox-core-5.3.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\luben\zstd-jni\1.4.9-1\zstd-jni-1.4.9-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\mjakubowski84\parquet4s-akka_2.13\2.2.0\parquet4s-akka_2.13-2.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\mjakubowski84\parquet4s-core_2.13\2.2.0\parquet4s-core_2.13-2.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\stephenc\jcip\jcip-annotations\1.0-1\jcip-annotations-1.0-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\code\findbugs\jsr305\3.0.2\jsr305-3.0.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\code\gson\gson\2.8.9\gson-2.8.9.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\errorprone\error_prone_annotations\2.2.0\error_prone_annotations-2.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\guava\failureaccess\1.0\failureaccess-1.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\guava\guava\27.0-jre\guava-27.0-jre.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\guava\listenablefuture\9999.0-empty-to-avoid-conflict-with-guava\listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\inject\guice\4.0\guice-4.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\j2objc\j2objc-annotations\1.1\j2objc-annotations-1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\protobuf\protobuf-java\2.5.0\protobuf-java-2.5.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\re2j\re2j\1.1\re2j-1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\nimbusds\nimbus-jose-jwt\9.8.1\nimbus-jose-jwt-9.8.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\squareup\okhttp\okhttp\2.7.5\okhttp-2.7.5.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\squareup\okio\okio\1.6.0\okio-1.6.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\contribs\jersey-guice\1.19\jersey-guice-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-client\1.19\jersey-client-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-core\1.19\jersey-core-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-json\1.19\jersey-json-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-server\1.19\jersey-server-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-servlet\1.19\jersey-servlet-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\xml\bind\jaxb-impl\2.2.3-1\jaxb-impl-2.2.3-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\thoughtworks\paranamer\paranamer\2.3\paranamer-2.3.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\akka\akka-actor_2.13\2.6.18\akka-actor_2.13-2.6.18.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\akka\akka-protobuf-v3_2.13\2.6.18\akka-protobuf-v3_2.13-2.6.18.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\akka\akka-stream_2.13\2.6.18\akka-stream_2.13-2.6.18.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\scala-logging\scala-logging_2.13\3.9.4\scala-logging_2.13-3.9.4.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\config\1.4.0\config-1.4.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\ssl-config-core_2.13\0.4.2\ssl-config-core_2.13-0.4.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-beanutils\commons-beanutils\1.9.4\commons-beanutils-1.9.4.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-cli\commons-cli\1.2\commons-cli-1.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-codec\commons-codec\1.11\commons-codec-1.11.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-collections\commons-collections\3.2.2\commons-collections-3.2.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-io\commons-io\2.8.0\commons-io-2.8.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-logging\commons-logging\1.2\commons-logging-1.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-net\commons-net\3.6\commons-net-3.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-pool\commons-pool\1.6\commons-pool-1.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\dnsjava\dnsjava\2.1.7\dnsjava-2.1.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\jakarta\activation\jakarta.activation-api\1.2.2\jakarta.activation-api-1.2.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\jakarta\xml\bind\jakarta.xml.bind-api\2.3.3\jakarta.xml.bind-api-2.3.3.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\annotation\javax.annotation-api\1.3.2\javax.annotation-api-1.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\inject\javax.inject\1\javax.inject-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\servlet\jsp\jsp-api\2.1\jsp-api-2.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\servlet\javax.servlet-api\3.1.0\javax.servlet-api-3.1.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\ws\rs\javax.ws.rs-api\2.1.1\javax.ws.rs-api-2.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\ws\rs\jsr311-api\1.1.1\jsr311-api-1.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\xml\bind\jaxb-api\2.2.11\jaxb-api-2.2.11.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\log4j\log4j\1.2.17\log4j-1.2.17.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\net\minidev\accessors-smart\2.4.7\accessors-smart-2.4.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\net\minidev\json-smart\2.4.7\json-smart-2.4.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\avro\avro\1.7.7\avro-1.7.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-compress\1.21\commons-compress-1.21.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-configuration2\2.1.1\commons-configuration2-2.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-lang3\3.12.0\commons-lang3-3.12.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-math3\3.1.1\commons-math3-3.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-text\1.4\commons-text-1.4.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\curator\curator-client\4.2.0\curator-client-4.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\curator\curator-framework\4.2.0\curator-framework-4.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\curator\curator-recipes\4.2.0\curator-recipes-4.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\thirdparty\hadoop-shaded-guava\1.1.1\hadoop-shaded-guava-1.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\thirdparty\hadoop-shaded-protobuf_3_7\1.1.1\hadoop-shaded-protobuf_3_7-1.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-annotations\3.3.2\hadoop-annotations-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-auth\3.3.2\hadoop-auth-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-client\3.3.2\hadoop-client-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-common\3.3.2\hadoop-common-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-hdfs-client\3.3.2\hadoop-hdfs-client-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-mapreduce-client-common\3.3.2\hadoop-mapreduce-client-common-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-mapreduce-client-core\3.3.2\hadoop-mapreduce-client-core-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-mapreduce-client-jobclient\3.3.2\hadoop-mapreduce-client-jobclient-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-yarn-api\3.3.2\hadoop-yarn-api-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-yarn-client\3.3.2\hadoop-yarn-client-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-yarn-common\3.3.2\hadoop-yarn-common-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\httpcomponents\httpclient\4.5.13\httpclient-4.5.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\httpcomponents\httpcore\4.4.13\httpcore-4.4.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-admin\1.0.1\kerb-admin-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-client\1.0.1\kerb-client-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-common\1.0.1\kerb-common-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-core\1.0.1\kerb-core-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-crypto\1.0.1\kerb-crypto-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-identity\1.0.1\kerb-identity-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-server\1.0.1\kerb-server-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-simplekdc\1.0.1\kerb-simplekdc-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-util\1.0.1\kerb-util-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-asn1\1.0.1\kerby-asn1-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-config\1.0.1\kerby-config-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-pkix\1.0.1\kerby-pkix-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-util\1.0.1\kerby-util-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-xdr\1.0.1\kerby-xdr-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\token-provider\1.0.1\token-provider-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-column\1.12.2\parquet-column-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-common\1.12.2\parquet-common-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-encoding\1.12.2\parquet-encoding-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-format-structures\1.12.2\parquet-format-structures-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-hadoop\1.12.2\parquet-hadoop-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-jackson\1.12.2\parquet-jackson-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\yetus\audience-annotations\0.12.0\audience-annotations-0.12.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\zookeeper\zookeeper-jute\3.5.6\zookeeper-jute-3.5.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\zookeeper\zookeeper\3.5.6\zookeeper-3.5.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\checkerframework\checker-qual\2.5.2\checker-qual-2.5.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-core-asl\1.9.13\jackson-core-asl-1.9.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-jaxrs\1.9.2\jackson-jaxrs-1.9.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-mapper-asl\1.9.13\jackson-mapper-asl-1.9.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-xc\1.9.2\jackson-xc-1.9.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jettison\jettison\1.1\jettison-1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\mojo\animal-sniffer-annotations\1.17\animal-sniffer-annotations-1.17.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\woodstox\stax2-api\4.2.1\stax2-api-4.2.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\websocket\websocket-api\9.4.43.v20210629\websocket-api-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\websocket\websocket-client\9.4.43.v20210629\websocket-client-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\websocket\websocket-common\9.4.43.v20210629\websocket-common-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-client\9.4.43.v20210629\jetty-client-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-http\9.4.43.v20210629\jetty-http-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-io\9.4.43.v20210629\jetty-io-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-security\9.4.43.v20210629\jetty-security-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-servlet\9.4.43.v20210629\jetty-servlet-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-util-ajax\9.4.43.v20210629\jetty-util-ajax-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-util\9.4.43.v20210629\jetty-util-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-webapp\9.4.43.v20210629\jetty-webapp-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-xml\9.4.43.v20210629\jetty-xml-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\jline\jline\3.9.0\jline-3.9.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\ow2\asm\asm\9.1\asm-9.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\reactivestreams\reactive-streams\1.0.3\reactive-streams-1.0.3.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\modules\scala-collection-compat_2.13\2.6.0\scala-collection-compat_2.13-2.6.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\modules\scala-java8-compat_2.13\1.0.0\scala-java8-compat_2.13-1.0.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\modules\scala-parser-combinators_2.13\1.1.2\scala-parser-combinators_2.13-1.1.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\scala-library\2.13.8\scala-library-2.13.8.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\scala-reflect\2.13.8\scala-reflect-2.13.8.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\slf4j\slf4j-api\2.0.0-alpha5\slf4j-api-2.0.0-alpha5.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\xerial\snappy\snappy-java\1.1.8.2\snappy-java-1.1.8.2.jar;C:\Program Files\JetBrains\IntelliJ IDEA 2022.2.1\lib\idea_rt.jar" Main
Connected to the target VM, address: '127.0.0.1:65487', transport: 'socket'
06:06:08.412 [main] DEBUG [Main$ Main.scala:89] - Writing... demo.0.parquet
06:19:49.240 [main] WARN [o.a.h.u.Shell Shell.java:692] - Did not find winutils.exe: {}
java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547)
at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568)
at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591)
at org.apache.hadoop.util.Shell.
This line LOL makes a lot of sense. Still, it would be nice to trace and fix, agreed? https://wiki.apache.org/hadoop/WindowsProblems Unless, of course, this is impossible. Admittedly, this is likely hadoop issue ( client that is ). So, great thank you, Marcin, to you for providing this great tool to troubleshoot and ferret out kinks like this one.
Well, a quick look at \parquet4s\core\src\main\scala\com\github\mjakubowski84\parquet4s\ParquetWriter.scala nets this finding:
In the end we do need to have hadoop on local. Which is ok.
Unless I miss something really glaring.
Let's see if we can find anything to change this. If my findings stand correct, I find this to be an advantage.
TBH, I haven't been using Windows for many, many years, so it is the first time I have seen such an error :) For sure, you do need local Hadoop when using a Hadoop client on Mac and Linux.
Thanks for spotting it!