[Feature Request] Add real-time console report output option to compare.matches() API
What you think about the following feature? I can assist with the design and lead implementation, but I would need some guidance, especially with snow/fudge.
Description
In datacompy's Compare interface, the .matches() method invokes an action to compare two dataframes. Currently, users call compare.matches() to obtain the Boolean match status and then separately invoke compare.report() to access the generated, human-readable summary. It may be beneficial to allow the report information to be streamed. For example, we could introduce a compare.matches(verbose=True) parameter that prints relevant parts of the report directly to the console as soon as the information becomes available.
My review currently covers mainly SparkSQLCompare. This addition seems feasible, since spark execution graph is already broken down.
Motivation
-
This feature would improve user experience, especially in interactive or debugging scenarios where instant feedback in the terminal is valuable. For example, an engineer could begin investigating issues as soon as they are reported. In spark when comparing large dataframes, the job might take hours and can consume significant DBU (cost). Early report messages often provide actionable information, for example, "Any duplicates on match values: Yes" motivates a duplicates investigation, or "Number of rows in df1 but not in df2" highlights missing data issues.
-
Cost savings: If a critical data issue is detected early, the job can be terminated immediately.
-
This approach aligns with expectations from similar Python tools that allow direct output toggling via a parameter.
It also enhances code interactivity and readability for quick checks and educational contexts (e.g., in Jupyter Notebooks and scripts).
@WiktorHawrylik I'm open to the idea. Just so I can understand a bit more Would you happen to have a sense of what the implementation would look like here?
@fdosani cool - let me draft some PR