pyspark-example-project
pyspark-example-project copied to clipboard
Wrong variables in example
https://github.com/AlexIoannides/pyspark-example-project/blob/13d6fb2f5fb45135499dbd1bc3f1bdac5b8451db/tests/test_etl_job.py#L64
You should use data_transformed
not expected_data
for actual transformation output.
Exactly. self.assertEqual(expected_cols, cols) should compare the length of expected_data.columns with the length of data_transformed.columns. But current code compares length of expected_data.columns with itself, according to line 53, 64 and 73.
line 53 expected_cols = len(expected_data.columns)
line 64 cols = len(expected_data.columns)
line 73 self.assertEqual(expected_cols, cols)
Seems 3 typos in total. rows and avg_steps also need to be updated. Replace variable expected_data as data_transformed in line 64, 65 and 67, as following
cols = len(data_transformed.columns)
rows = data_transformed.count()
avg_steps = (
data_transformed
.agg(mean('steps_to_desk').alias('avg_steps_to_desk'))
.collect()[0]
['avg_steps_to_desk'])