spark-bigquery-connector icon indicating copy to clipboard operation
spark-bigquery-connector copied to clipboard

Unnecessary dependency to spark-mlllib

Open irajhedayati opened this issue 2 years ago • 1 comments

In order to support SparkML types of "vector" and "matrix", the SupportedCustomDataType enum is added which has a reference to spark-mllib library. For a code that is using only core and sql, my situation, I don't feel it is necessary to add spark-mllib library.

We could avoid it by adding a helper to this enum so that the check for the field types to see if they are "vector" or "matrix" is done outside of the class.

Here is the exception that I get when running the save;

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/ml/linalg/SQLDataTypes
	at com.google.cloud.spark.bigquery.SupportedCustomDataType.<clinit>(SupportedCustomDataType.java:25)
	at com.google.cloud.spark.bigquery.BigQueryWriteHelper.$anonfun$updateMetadataIfNeeded$1(BigQueryWriteHelper.scala:96)
	at com.google.cloud.spark.bigquery.BigQueryWriteHelper.$anonfun$updateMetadataIfNeeded$1$adapted(BigQueryWriteHelper.scala:95)
	at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)

irajhedayati avatar Apr 21 '22 18:04 irajhedayati

Created a PR https://github.com/GoogleCloudDataproc/spark-bigquery-connector/pull/601

irajhedayati avatar Apr 22 '22 04:04 irajhedayati