datacompy
datacompy copied to clipboard
Mismatch between Summary Report and Actual
Summary report shows there are records in base and compare dataset that did not have corresponding matches
****** Row Summary ****** Number of rows in common: 2971140 Number of rows in base but not compare: 41536 Number of rows in compare but not base: 41536** Number of duplicate rows found in base: 3721 Number of duplicate rows found in compare: 3721
but when i try to get the records in base and/or compare that do not have matches.. it returns 0...
Using latest Spark version to do comparison...
Any thoughts/suggestions on what might be the issue. I was hoping to see 41536 records for compare.rows_only_compare
Can you provide some more details on your initialization of the SparkCompare
, and also the full commands you are running in the notebook? The image is cutoff in cell 18 and 19.
The code being run here is this. It would be nice to know your join_columns
, and some sampling of the data to get a sense of why this might be happening.
Thank you for the response and pointer.
Environment Variables and Initialization for spark
%env PYSPARK_PYTHON=python
%env PYSPARK_DRIVER_PYTHON_OPTS=notebook
%env PYSPARK_DRIVER_PYTHON=jupyter
%env JAVA_HOME=C:\Program Files\Eclipse Adoptium\jdk-11.0.14.9-hotspot
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.window import Window
from pyspark.sql.functions import when, expr,row_number,rank,col, size, split, concat_ws, upper, to_timestamp
spark = SparkSession.builder \
.master('local[*]') \
.appName('datacompy') \
.config("spark.driver.memory", "40g") \
.getOrCreate()
Regarding image being cut off cell 18 & 19... its nothing except option to print vertically and truncation being off
compare.rows_only_compare.show(n=1,vertical=True,truncate=False)
Datasets being compared are being joined on 6 different elements/columns. using the
join_columns=['fld1','fld2','fld3','fld4','fld5','fld6']
Datacompy was throwing errors while comparing. Those errors went away after I substituted any null values with some default value for fld4 and fld6. fl2 thru fld 5 are strings and fld 6 is datetime
But there were discrepancies in report as outlined in original post. Going through code and manually joining and verifying the counts against datacompy, noticed fld5 also had nulls.
So technically there were no records that didnt have corresponding entry between two datasets but just the output/summary report was off.
After filling nulls with default value for fld 5 and rerunning datacompy.. get the correct Row Summary
SUMMARY
Ensure there are no nulls for the data elements that are being used for joins as datacompy may either throw errors or provide wrong summary counts.
Unfortunately I cannot share the data I am working with but as time permits, will try to put together a sample dataset that can reproduce the issue and even see how to handle nulls within join to alleviate the issue. But in the meantime figured will share what I found and hopefully you can get to fix faster than I can
Thank you
Maybe this is related to https://github.com/capitalone/datacompy/issues/147 and the null safe compares. It might be worth testing the branch I have here to see if it solves the issue or not: spark-compare-refactor
@guptaat just wanted to follow up on this if you had a chance to test out my suggestion.
@fdosani apologies. got sidetracked in between due to other projects. Hopefully will pick it back again this week and report back asap. Thank you
@fdosani I installed the spark compare refactor branch using
pip install --upgrade git+https://github.com/capitalone/datacompy@spark-compare-refactor
It did not seem to solve the problem as got the results as below
Looked at the code again for sparkcompare.py and didn't notice it to be different than the version in develop branch...
especially the join condition... where code still seems to be using '='
Do we have the changes committed or I am missing something basic?
@guptaat Sorry for the delay. In that branch there are 2 spark compares. One is the old way and another is the refactor. Can you double check you are using: from datacompy.spare_core import SparkCompare
to import and use the new logic.
The new logic is using eqNullSafe
shown here: https://github.com/capitalone/datacompy/blob/59c83daed096145839835cbce906f986dfb9d7a1/datacompy/spark_core.py#L846
@fdosani
Yes spark.core.py does have eqNullSafe ( i was previously looking at sparkcompare.py)
I explicitly added the import statement as u suggested
Got the following results
but if I specifically replace nulls in the datasets... I get the following results which is what I was expecting
so issue still remains .. I will continue to tinker and keep u posted of any progress
Thanks
Feel free to reopen if this is still an issue. Closing for now.