spark-bigquery-connector Unnecessary dependency to spark-mlllib

Unnecessary dependency to spark-mlllib

Open irajhedayati opened this issue 2 years ago • 1 comments

In order to support SparkML types of "vector" and "matrix", the SupportedCustomDataType enum is added which has a reference to spark-mllib library. For a code that is using only core and sql, my situation, I don't feel it is necessary to add spark-mllib library.

We could avoid it by adding a helper to this enum so that the check for the field types to see if they are "vector" or "matrix" is done outside of the class.

Here is the exception that I get when running the save;

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/ml/linalg/SQLDataTypes
	at com.google.cloud.spark.bigquery.SupportedCustomDataType.<clinit>(SupportedCustomDataType.java:25)
	at com.google.cloud.spark.bigquery.BigQueryWriteHelper.$anonfun$updateMetadataIfNeeded$1(BigQueryWriteHelper.scala:96)
	at com.google.cloud.spark.bigquery.BigQueryWriteHelper.$anonfun$updateMetadataIfNeeded$1$adapted(BigQueryWriteHelper.scala:95)
	at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)

Apr 21 '22 18:04 irajhedayati

Created a PR https://github.com/GoogleCloudDataproc/spark-bigquery-connector/pull/601

Apr 22 '22 04:04 irajhedayati

spark-bigquery-connector spark-bigquery-connector copied to clipboard

Unnecessary dependency to spark-mlllib

spark-bigquery-connector
spark-bigquery-connector copied to clipboard