dist-keras
dist-keras copied to clipboard
Addressing AnalysisException error in workflow example
Without this change, currently getting the error
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<ipython-input-38-3d60eef83573> in <module>()
1 # Only select the columns we need (less data shuffling) while training.
----> 2 dataset = dataset.select("features_normalized", "label_index", "label")
3 print(dataset)
/opt/mapr/spark/spark-2.1.0/python/pyspark/sql/dataframe.py in select(self, *cols)
990 [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
991 """
--> 992 jdf = self._jdf.select(self._jcols(*cols))
993 return DataFrame(jdf, self.sql_ctx)
994
/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/opt/mapr/spark/spark-2.1.0/python/pyspark/sql/utils.py in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u"Reference 'label_out' is ambiguous, could be: label#320, label#324.;"
when running the section
# For example:
# 1. Assume we have a label index: 3
# 2. Output dimensionality: 5
# With these parameters, we obtain the following vector in the DataFrame column: [0,0,0,1,0]
transformer = OneHotTransformer(output_dim=nb_classes, input_col="label_index", output_col="label")
dataset = transformer.transform(dataset)
# Only select the columns we need (less data shuffling) while training.
dataset = dataset.select("features_normalized", "label_index", "label")
Seems to be due to the fact that he raw_dataset
already contains a field called Label
and there seems to be a conflict in differentiating between this field and the label
field that gets created later (as if the pyspark.sql SparkSession looks at both in lowercase form).