pyspark-example-project Wrong variables in example

Wrong variables in example

Open minhsphuc12 opened this issue 4 years ago • 1 comments

https://github.com/AlexIoannides/pyspark-example-project/blob/13d6fb2f5fb45135499dbd1bc3f1bdac5b8451db/tests/test_etl_job.py#L64

You should use data_transformed not expected_data for actual transformation output.

Sep 30 '20 16:09 minhsphuc12

Exactly. self.assertEqual(expected_cols, cols) should compare the length of expected_data.columns with the length of data_transformed.columns. But current code compares length of expected_data.columns with itself, according to line 53, 64 and 73.

line 53 expected_cols = len(expected_data.columns)

line 64 cols = len(expected_data.columns)

line 73 self.assertEqual(expected_cols, cols)

Seems 3 typos in total. rows and avg_steps also need to be updated. Replace variable expected_data as data_transformed in line 64, 65 and 67, as following

    cols = len(data_transformed.columns)
    rows = data_transformed.count()
    avg_steps = (
        data_transformed
        .agg(mean('steps_to_desk').alias('avg_steps_to_desk'))
        .collect()[0]
        ['avg_steps_to_desk'])

Feb 15 '24 10:02 Philipkk

pyspark-example-project pyspark-example-project copied to clipboard

Wrong variables in example

pyspark-example-project
pyspark-example-project copied to clipboard