parquet4s icon indicating copy to clipboard operation
parquet4s copied to clipboard

Can I trouble you for code review to integrated newer version of parquet4s into delta DSR/DSW?

Open MironAtHome opened this issue 2 years ago • 19 comments

I am not sure how much effort, just asking if you would be willing and available to look over PR. It's currently stack at 1.2.1 and it's just too far back. Causes troubles with deps and I really don't like shading if I have to use it as a workaround.

MironAtHome avatar Mar 23 '22 01:03 MironAtHome

@MironAtHome Sure thing!

mjakubowski84 avatar Mar 23 '22 07:03 mjakubowski84

Most of the changes were to wrap file path expressed as string into .Path imported from latest version of parquet4s parquet implementation. However, two kinds of errors turned out a bit heavier to mend. 1, ParquetReader.read in CloseableParquetDataIterator.scala:154 2. ValueCodec with root of the trouble in RowParquetRecordImpl.scala:308 and a few lines, related to ValueCodec. I am not certain what to replace those with directly, if such a substitute readily available, please suggest, else, will put together a few lines to wrap new classes and expose same interface

MironAtHome avatar Mar 23 '22 20:03 MironAtHome

I think that this can be helpful: https://mjakubowski84.github.io/parquet4s/docs/migration/

mjakubowski84 avatar Mar 24 '22 08:03 mjakubowski84

Regarding RowParquetRecordImpl:

  • ValueCodec's implementation is now split into ValueEncoder + ValueDecoder so here probably you want to use ValueDecoder.intDecoder
  • private def customSeqCodec[T](elementCodec: ValueCodec[T])(implicit seems to be redundant for me. The comment above the function sounds as not true to me.

mjakubowski84 avatar Mar 24 '22 08:03 mjakubowski84

Great thank you for your guidance and help. Started from this pr to ensure build. https://github.com/mjakubowski84/parquet4s/pull/257 The replacement of ValueCodec worked, down to the last 5 errors, all related to type passed as a generic to reading. If I don't get enough time to finish it today, will finish over Sat/Sun. And thank you for tending to PR above.

MironAtHome avatar Mar 24 '22 17:03 MironAtHome

Hey Marcin, pr is ready https://github.com/delta-io/connectors/pull/303 please provide feedback

MironAtHome avatar Mar 25 '22 19:03 MironAtHome

Two comment from me only. One is a minor code change but the other worries me. I see that Delta relies on some ancient version of parquet-hadoop. I wonder if that can be upgraded without any issue.

mjakubowski84 avatar Mar 26 '22 11:03 mjakubowski84

Hey Marcin, sorry for long time to turn around, I had to get over case of covid. I have ran into an issue with boolean type decode in the connector with edition 2.3.0 ( latest available on maven ) of parquet4s. Could you please look? I have created a private repo for this unit test. https://github.com/MironAtHome/connectors-private.git branch miron/integrate-parquet4s-23 here is the lines where I am getting assertion:

      val b: Boolean = (i % 2 == 0)
      val rowB: Boolean = row.getBoolean("as_boolean")
      println(s"Evaluating row value ${rowB} to ${b} for (${i} % 2 == 0)} as ${(i % 2 == 0)}")
      assert(row.getBoolean("as_boolean") == (i % 2 == 0))

here is assertion text, with a few trace rows printed prior: --- [info] DeltaDataReaderSuite: Evaluating row value true to true for (4 % 2 == 0)} as true Evaluating row value false to false for (5 % 2 == 0)} as false Evaluating row value true to true for (6 % 2 == 0)} as true Evaluating row value false to false for (7 % 2 == 0)} as false Evaluating row value true to true for (8 % 2 == 0)} as true Evaluating row value false to false for (9 % 2 == 0)} as false Evaluating row value false to true for (0 % 2 == 0)} as true [info] - read - primitives *** FAILED *** [info] false did not equal true (DeltaDataReaderSuite.scala:87) --- Here is rows from the test table, as per spark read: scala> spark.read.format("delta").load("./golden-tables/src/test/resources/golden/data-reader-primitives").show() +------+-------+-------+--------+----------+--------+---------+---------+---------+--------------+ |as_int|as_long|as_byte|as_short|as_boolean|as_float|as_double|as_string|as_binary|as_big_decimal| +------+-------+-------+--------+----------+--------+---------+---------+---------+--------------+ | 4| 4| 4| 4| true| 4.0| 4.0| 4| [04 04]| 4| | 5| 5| 5| 5| false| 5.0| 5.0| 5| [05 05]| 5| | 6| 6| 6| 6| true| 6.0| 6.0| 6| [06 06]| 6| | 7| 7| 7| 7| false| 7.0| 7.0| 7| [07 07]| 7| | 8| 8| 8| 8| true| 8.0| 8.0| 8| [08 08]| 8| | 9| 9| 9| 9| false| 9.0| 9.0| 9| [09 09]| 9| | null| null| null| null| null| null| null| null| null| null| | 0| 0| 0| 0| true| 0.0| 0.0| 0| [00 00]| 0| | 1| 1| 1| 1| false| 1.0| 1.0| 1| [01 01]| 1| | 2| 2| 2| 2| true| 2.0| 2.0| 2| [02 02]| 2| | 3| 3| 3| 3| false| 3.0| 3.0| 3| [03 03]| 3| +------+-------+-------+--------+----------+--------+---------+---------+---------+--------------+ as per above, spark 3.2.1 with hadoop 3.3.1 reads values correctly, however, unit test produces assertion for value in the variable rowB containing "false" when expected value is "true" for the row where field "as_int" equals to 0.

This is affecting DeltaDataReaderSuite.scala

I would much appreciate your help to verify this issue.

MironAtHome avatar Apr 18 '22 21:04 MironAtHome

Hi @MironAtHome.

I do not see the code as your repo is private. I recommend to have a look at boolean decoder and debug what happens there. I do not recall if business logic changed there since version 1.0 - probably not. And BTW - latest version of Parquet4s is now 1.4.1 :)

mjakubowski84 avatar Apr 20 '22 14:04 mjakubowski84

Marcin, it is so nice to have your comments. I apologies, I knew my repo setup security wasn't right for your access, but I thought that having you invited through issue link might open it up for you. Give me till EOD ( it's 9:47AM in Seattle right now ) to get to it, or ask you for further assistance. Will open access right now.

MironAtHome avatar Apr 20 '22 16:04 MironAtHome

I took me time to debug those tests. Thousands of dependencies and buggy resource loading of golden tables.

So, the issue seems to be with the test data or with the test itself.

Tests fail on NullValue. as_boolean is set to be nullable. In effect here we cast null to Boolean - https://github.com/MironAtHome/connectors-private/blob/miron/integrate-parquet4s-23/standalone/src/main/scala-2.12/io/delta/standalone/internal/data/RowParquetRecordImpl.scala#L167. And null is resolved as false. So if the test expects true then it must fail.

mjakubowski84 avatar Apr 21 '22 14:04 mjakubowski84

Closing due to inactivity.

mjakubowski84 avatar Jul 12 '22 15:07 mjakubowski84

Ok. Let me revisit tests and code. Much time has passed, but it's worth revisiting. Just tried running _gym project. Between now and then I had to rebuild my machine ( computer ). And was a bit surprised to find debugger failing due to not finding HADOOP_HOME environment variable. After all this time I guess it's a bit late to ask, does parquet4s have dependency on hadoop being present and configured on the machine?

MironAtHome avatar Nov 10 '22 15:11 MironAtHome

Hi! Great to hear that. No, there's no need to have Hadoop on your machine (I don't). And I don't have HADOOP_HOME set, too.

mjakubowski84 avatar Nov 10 '22 15:11 mjakubowski84

Here is my stack dump, first with screenshot of debugger stepped into method caused exception: image

MironAtHome avatar Nov 11 '22 14:11 MironAtHome

Stack trace: "C:\Program Files\Microsoft\jdk-11.0.12.7-hotspot\bin\java.exe" -agentlib:jdwp=transport=dt_socket,address=127.0.0.1:65487,suspend=y,server=n -javaagent:C:\Users\user\AppData\Local\JetBrains\IntelliJIdea2022.2\captureAgent\debugger-agent.jar -Dfile.encoding=UTF-8 -classpath "F:\HL\dev\git\delta-standlone\parquet4s-gym\target\scala-2.13\classes;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\aopalliance\aopalliance\1.0\aopalliance-1.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\ch\qos\logback\logback-classic\1.3.0-alpha14\logback-classic-1.3.0-alpha14.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\ch\qos\logback\logback-core\1.3.0-alpha14\logback-core-1.3.0-alpha14.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\beachape\enumeratum-macros_2.13\1.6.1\enumeratum-macros_2.13-1.6.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\beachape\enumeratum_2.13\1.7.0\enumeratum_2.13-1.7.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\chuusai\shapeless_2.13\2.3.7\shapeless_2.13-2.3.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-annotations\2.13.0\jackson-annotations-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-core\2.13.0\jackson-core-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-databind\2.13.0\jackson-databind-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\jaxrs\jackson-jaxrs-base\2.13.0\jackson-jaxrs-base-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\jaxrs\jackson-jaxrs-json-provider\2.13.0\jackson-jaxrs-json-provider-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\module\jackson-module-jaxb-annotations\2.13.0\jackson-module-jaxb-annotations-2.13.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\woodstox\woodstox-core\5.3.0\woodstox-core-5.3.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\luben\zstd-jni\1.4.9-1\zstd-jni-1.4.9-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\mjakubowski84\parquet4s-akka_2.13\2.2.0\parquet4s-akka_2.13-2.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\mjakubowski84\parquet4s-core_2.13\2.2.0\parquet4s-core_2.13-2.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\github\stephenc\jcip\jcip-annotations\1.0-1\jcip-annotations-1.0-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\code\findbugs\jsr305\3.0.2\jsr305-3.0.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\code\gson\gson\2.8.9\gson-2.8.9.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\errorprone\error_prone_annotations\2.2.0\error_prone_annotations-2.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\guava\failureaccess\1.0\failureaccess-1.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\guava\guava\27.0-jre\guava-27.0-jre.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\guava\listenablefuture\9999.0-empty-to-avoid-conflict-with-guava\listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\inject\guice\4.0\guice-4.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\j2objc\j2objc-annotations\1.1\j2objc-annotations-1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\protobuf\protobuf-java\2.5.0\protobuf-java-2.5.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\google\re2j\re2j\1.1\re2j-1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\nimbusds\nimbus-jose-jwt\9.8.1\nimbus-jose-jwt-9.8.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\squareup\okhttp\okhttp\2.7.5\okhttp-2.7.5.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\squareup\okio\okio\1.6.0\okio-1.6.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\contribs\jersey-guice\1.19\jersey-guice-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-client\1.19\jersey-client-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-core\1.19\jersey-core-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-json\1.19\jersey-json-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-server\1.19\jersey-server-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\jersey\jersey-servlet\1.19\jersey-servlet-1.19.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\sun\xml\bind\jaxb-impl\2.2.3-1\jaxb-impl-2.2.3-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\thoughtworks\paranamer\paranamer\2.3\paranamer-2.3.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\akka\akka-actor_2.13\2.6.18\akka-actor_2.13-2.6.18.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\akka\akka-protobuf-v3_2.13\2.6.18\akka-protobuf-v3_2.13-2.6.18.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\akka\akka-stream_2.13\2.6.18\akka-stream_2.13-2.6.18.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\scala-logging\scala-logging_2.13\3.9.4\scala-logging_2.13-3.9.4.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\config\1.4.0\config-1.4.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\typesafe\ssl-config-core_2.13\0.4.2\ssl-config-core_2.13-0.4.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-beanutils\commons-beanutils\1.9.4\commons-beanutils-1.9.4.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-cli\commons-cli\1.2\commons-cli-1.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-codec\commons-codec\1.11\commons-codec-1.11.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-collections\commons-collections\3.2.2\commons-collections-3.2.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-io\commons-io\2.8.0\commons-io-2.8.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-logging\commons-logging\1.2\commons-logging-1.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-net\commons-net\3.6\commons-net-3.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\commons-pool\commons-pool\1.6\commons-pool-1.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\dnsjava\dnsjava\2.1.7\dnsjava-2.1.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\jakarta\activation\jakarta.activation-api\1.2.2\jakarta.activation-api-1.2.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\jakarta\xml\bind\jakarta.xml.bind-api\2.3.3\jakarta.xml.bind-api-2.3.3.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\annotation\javax.annotation-api\1.3.2\javax.annotation-api-1.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\inject\javax.inject\1\javax.inject-1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\servlet\jsp\jsp-api\2.1\jsp-api-2.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\servlet\javax.servlet-api\3.1.0\javax.servlet-api-3.1.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\ws\rs\javax.ws.rs-api\2.1.1\javax.ws.rs-api-2.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\ws\rs\jsr311-api\1.1.1\jsr311-api-1.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\javax\xml\bind\jaxb-api\2.2.11\jaxb-api-2.2.11.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\log4j\log4j\1.2.17\log4j-1.2.17.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\net\minidev\accessors-smart\2.4.7\accessors-smart-2.4.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\net\minidev\json-smart\2.4.7\json-smart-2.4.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\avro\avro\1.7.7\avro-1.7.7.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-compress\1.21\commons-compress-1.21.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-configuration2\2.1.1\commons-configuration2-2.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-lang3\3.12.0\commons-lang3-3.12.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-math3\3.1.1\commons-math3-3.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\commons\commons-text\1.4\commons-text-1.4.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\curator\curator-client\4.2.0\curator-client-4.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\curator\curator-framework\4.2.0\curator-framework-4.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\curator\curator-recipes\4.2.0\curator-recipes-4.2.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\thirdparty\hadoop-shaded-guava\1.1.1\hadoop-shaded-guava-1.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\thirdparty\hadoop-shaded-protobuf_3_7\1.1.1\hadoop-shaded-protobuf_3_7-1.1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-annotations\3.3.2\hadoop-annotations-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-auth\3.3.2\hadoop-auth-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-client\3.3.2\hadoop-client-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-common\3.3.2\hadoop-common-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-hdfs-client\3.3.2\hadoop-hdfs-client-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-mapreduce-client-common\3.3.2\hadoop-mapreduce-client-common-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-mapreduce-client-core\3.3.2\hadoop-mapreduce-client-core-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-mapreduce-client-jobclient\3.3.2\hadoop-mapreduce-client-jobclient-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-yarn-api\3.3.2\hadoop-yarn-api-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-yarn-client\3.3.2\hadoop-yarn-client-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\hadoop\hadoop-yarn-common\3.3.2\hadoop-yarn-common-3.3.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\httpcomponents\httpclient\4.5.13\httpclient-4.5.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\httpcomponents\httpcore\4.4.13\httpcore-4.4.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-admin\1.0.1\kerb-admin-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-client\1.0.1\kerb-client-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-common\1.0.1\kerb-common-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-core\1.0.1\kerb-core-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-crypto\1.0.1\kerb-crypto-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-identity\1.0.1\kerb-identity-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-server\1.0.1\kerb-server-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-simplekdc\1.0.1\kerb-simplekdc-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerb-util\1.0.1\kerb-util-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-asn1\1.0.1\kerby-asn1-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-config\1.0.1\kerby-config-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-pkix\1.0.1\kerby-pkix-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-util\1.0.1\kerby-util-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\kerby-xdr\1.0.1\kerby-xdr-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\kerby\token-provider\1.0.1\token-provider-1.0.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-column\1.12.2\parquet-column-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-common\1.12.2\parquet-common-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-encoding\1.12.2\parquet-encoding-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-format-structures\1.12.2\parquet-format-structures-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-hadoop\1.12.2\parquet-hadoop-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\parquet\parquet-jackson\1.12.2\parquet-jackson-1.12.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\yetus\audience-annotations\0.12.0\audience-annotations-0.12.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\zookeeper\zookeeper-jute\3.5.6\zookeeper-jute-3.5.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\apache\zookeeper\zookeeper\3.5.6\zookeeper-3.5.6.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\checkerframework\checker-qual\2.5.2\checker-qual-2.5.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-core-asl\1.9.13\jackson-core-asl-1.9.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-jaxrs\1.9.2\jackson-jaxrs-1.9.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-mapper-asl\1.9.13\jackson-mapper-asl-1.9.13.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jackson\jackson-xc\1.9.2\jackson-xc-1.9.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\jettison\jettison\1.1\jettison-1.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\mojo\animal-sniffer-annotations\1.17\animal-sniffer-annotations-1.17.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\codehaus\woodstox\stax2-api\4.2.1\stax2-api-4.2.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\websocket\websocket-api\9.4.43.v20210629\websocket-api-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\websocket\websocket-client\9.4.43.v20210629\websocket-client-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\websocket\websocket-common\9.4.43.v20210629\websocket-common-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-client\9.4.43.v20210629\jetty-client-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-http\9.4.43.v20210629\jetty-http-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-io\9.4.43.v20210629\jetty-io-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-security\9.4.43.v20210629\jetty-security-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-servlet\9.4.43.v20210629\jetty-servlet-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-util-ajax\9.4.43.v20210629\jetty-util-ajax-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-util\9.4.43.v20210629\jetty-util-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-webapp\9.4.43.v20210629\jetty-webapp-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\eclipse\jetty\jetty-xml\9.4.43.v20210629\jetty-xml-9.4.43.v20210629.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\jline\jline\3.9.0\jline-3.9.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\ow2\asm\asm\9.1\asm-9.1.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\reactivestreams\reactive-streams\1.0.3\reactive-streams-1.0.3.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\modules\scala-collection-compat_2.13\2.6.0\scala-collection-compat_2.13-2.6.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\modules\scala-java8-compat_2.13\1.0.0\scala-java8-compat_2.13-1.0.0.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\modules\scala-parser-combinators_2.13\1.1.2\scala-parser-combinators_2.13-1.1.2.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\scala-library\2.13.8\scala-library-2.13.8.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\scala-lang\scala-reflect\2.13.8\scala-reflect-2.13.8.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\slf4j\slf4j-api\2.0.0-alpha5\slf4j-api-2.0.0-alpha5.jar;C:\Users\user\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\org\xerial\snappy\snappy-java\1.1.8.2\snappy-java-1.1.8.2.jar;C:\Program Files\JetBrains\IntelliJ IDEA 2022.2.1\lib\idea_rt.jar" Main Connected to the target VM, address: '127.0.0.1:65487', transport: 'socket' 06:06:08.412 [main] DEBUG [Main$ Main.scala:89] - Writing... demo.0.parquet 06:19:49.240 [main] WARN [o.a.h.u.Shell Shell.java:692] - Did not find winutils.exe: {} java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547) at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568) at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591) at org.apache.hadoop.util.Shell.(Shell.java:688) at org.apache.hadoop.util.StringUtils.(StringUtils.java:79) at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3741) at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3736) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3520) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:288) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:524) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.parquet.hadoop.util.HadoopOutputFile.fromPath(HadoopOutputFile.java:58) at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:655) at com.github.mjakubowski84.parquet4s.ParquetWriter$.internalWriter(ParquetWriter.scala:129) at com.github.mjakubowski84.parquet4s.SingleFileParquetSink$.com$github$mjakubowski84$parquet4s$SingleFileParquetSink$$apply(SingleFileParquetSink.scala:67) at com.github.mjakubowski84.parquet4s.SingleFileParquetSink$BuilderImpl.write(SingleFileParquetSink.scala:57) at Main$.write(Main.scala:95) at Main$.main(Main.scala:67) at Main.main(Main.scala) Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:467) at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:438) at org.apache.hadoop.util.Shell.(Shell.java:515) ... 16 common frames omitted Exception in thread "main" java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:735) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:270) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:286) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:324) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:294) at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:439) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:428) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:459) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:433) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:521) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:500) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175) at org.apache.parquet.hadoop.util.HadoopOutputFile.createOrOverwrite(HadoopOutputFile.java:81) at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:327) at org.apache.parquet.hadoop.ParquetWriter.(ParquetWriter.java:292) at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:658) at com.github.mjakubowski84.parquet4s.ParquetWriter$.internalWriter(ParquetWriter.scala:129) at com.github.mjakubowski84.parquet4s.SingleFileParquetSink$.com$github$mjakubowski84$parquet4s$SingleFileParquetSink$$apply(SingleFileParquetSink.scala:67) at com.github.mjakubowski84.parquet4s.SingleFileParquetSink$BuilderImpl.write(SingleFileParquetSink.scala:57) at Main$.write(Main.scala:95) at Main$.main(Main.scala:67) at Main.main(Main.scala) Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547) at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568) at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591) at org.apache.hadoop.util.Shell.(Shell.java:688) at org.apache.hadoop.util.StringUtils.(StringUtils.java:79) at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3741) at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3736) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3520) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:288) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:524) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.parquet.hadoop.util.HadoopOutputFile.fromPath(HadoopOutputFile.java:58) at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:655) ... 6 more Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:467) at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:438) at org.apache.hadoop.util.Shell.(Shell.java:515) ... 16 more

MironAtHome avatar Nov 11 '22 14:11 MironAtHome

This line LOL makes a lot of sense. Still, it would be nice to trace and fix, agreed? https://wiki.apache.org/hadoop/WindowsProblems Unless, of course, this is impossible. Admittedly, this is likely hadoop issue ( client that is ). So, great thank you, Marcin, to you for providing this great tool to troubleshoot and ferret out kinks like this one.

MironAtHome avatar Nov 11 '22 14:11 MironAtHome

Well, a quick look at \parquet4s\core\src\main\scala\com\github\mjakubowski84\parquet4s\ParquetWriter.scala nets this finding: image In the end we do need to have hadoop on local. Which is ok. Unless I miss something really glaring. Let's see if we can find anything to change this. If my findings stand correct, I find this to be an advantage.

MironAtHome avatar Nov 15 '22 17:11 MironAtHome

TBH, I haven't been using Windows for many, many years, so it is the first time I have seen such an error :) For sure, you do need local Hadoop when using a Hadoop client on Mac and Linux.

Thanks for spotting it!

mjakubowski84 avatar Nov 16 '22 18:11 mjakubowski84