datacompy icon indicating copy to clipboard operation
datacompy copied to clipboard

Mismatch between Summary Report and Actual

Open guptaat opened this issue 1 year ago • 8 comments

Summary report shows there are records in base and compare dataset that did not have corresponding matches

****** Row Summary ****** Number of rows in common: 2971140 Number of rows in base but not compare: 41536 Number of rows in compare but not base: 41536** Number of duplicate rows found in base: 3721 Number of duplicate rows found in compare: 3721

but when i try to get the records in base and/or compare that do not have matches.. it returns 0...

image

Using latest Spark version to do comparison...

Any thoughts/suggestions on what might be the issue. I was hoping to see 41536 records for compare.rows_only_compare

guptaat avatar Aug 08 '22 15:08 guptaat

Can you provide some more details on your initialization of the SparkCompare, and also the full commands you are running in the notebook? The image is cutoff in cell 18 and 19.

The code being run here is this. It would be nice to know your join_columns, and some sampling of the data to get a sense of why this might be happening.

fdosani avatar Aug 15 '22 15:08 fdosani

Thank you for the response and pointer.

Environment Variables and Initialization for spark %env PYSPARK_PYTHON=python %env PYSPARK_DRIVER_PYTHON_OPTS=notebook %env PYSPARK_DRIVER_PYTHON=jupyter %env JAVA_HOME=C:\Program Files\Eclipse Adoptium\jdk-11.0.14.9-hotspot

from pyspark.sql import SparkSession import pyspark.sql.functions as f from pyspark.sql.window import Window from pyspark.sql.functions import when, expr,row_number,rank,col, size, split, concat_ws, upper, to_timestamp spark = SparkSession.builder \ .master('local[*]') \ .appName('datacompy') \ .config("spark.driver.memory", "40g") \ .getOrCreate()

Regarding image being cut off cell 18 & 19... its nothing except option to print vertically and truncation being off compare.rows_only_compare.show(n=1,vertical=True,truncate=False)

Datasets being compared are being joined on 6 different elements/columns. using the join_columns=['fld1','fld2','fld3','fld4','fld5','fld6']

Datacompy was throwing errors while comparing. Those errors went away after I substituted any null values with some default value for fld4 and fld6. fl2 thru fld 5 are strings and fld 6 is datetime

But there were discrepancies in report as outlined in original post. Going through code and manually joining and verifying the counts against datacompy, noticed fld5 also had nulls.

So technically there were no records that didnt have corresponding entry between two datasets but just the output/summary report was off.

After filling nulls with default value for fld 5 and rerunning datacompy.. get the correct Row Summary

image

SUMMARY

Ensure there are no nulls for the data elements that are being used for joins as datacompy may either throw errors or provide wrong summary counts.

Unfortunately I cannot share the data I am working with but as time permits, will try to put together a sample dataset that can reproduce the issue and even see how to handle nulls within join to alleviate the issue. But in the meantime figured will share what I found and hopefully you can get to fix faster than I can

Thank you

guptaat avatar Aug 16 '22 18:08 guptaat

Maybe this is related to https://github.com/capitalone/datacompy/issues/147 and the null safe compares. It might be worth testing the branch I have here to see if it solves the issue or not: spark-compare-refactor

fdosani avatar Aug 16 '22 18:08 fdosani

@guptaat just wanted to follow up on this if you had a chance to test out my suggestion.

fdosani avatar Aug 29 '22 12:08 fdosani

@fdosani apologies. got sidetracked in between due to other projects. Hopefully will pick it back again this week and report back asap. Thank you

guptaat avatar Aug 29 '22 15:08 guptaat

@fdosani I installed the spark compare refactor branch using pip install --upgrade git+https://github.com/capitalone/datacompy@spark-compare-refactor

It did not seem to solve the problem as got the results as below

image

Looked at the code again for sparkcompare.py and didn't notice it to be different than the version in develop branch...

especially the join condition... where code still seems to be using '='

image

Do we have the changes committed or I am missing something basic?

guptaat avatar Aug 29 '22 19:08 guptaat

@guptaat Sorry for the delay. In that branch there are 2 spark compares. One is the old way and another is the refactor. Can you double check you are using: from datacompy.spare_core import SparkCompare to import and use the new logic.

The new logic is using eqNullSafe shown here: https://github.com/capitalone/datacompy/blob/59c83daed096145839835cbce906f986dfb9d7a1/datacompy/spark_core.py#L846

fdosani avatar Sep 01 '22 13:09 fdosani

@fdosani

Yes spark.core.py does have eqNullSafe ( i was previously looking at sparkcompare.py)

image

I explicitly added the import statement as u suggested

image

Got the following results image

but if I specifically replace nulls in the datasets... I get the following results which is what I was expecting

image

so issue still remains .. I will continue to tinker and keep u posted of any progress

Thanks

guptaat avatar Sep 13 '22 18:09 guptaat

Feel free to reopen if this is still an issue. Closing for now.

fdosani avatar Jun 22 '23 14:06 fdosani