Task: Refactor data diffing to run on input data for each service

Open cfreedman opened this issue 8 months ago • 0 comments

Currently the data-diffing works by reading from postgres for most recent, two tables of a dataset (by default the complete, output table from the pipeline all_properties_end), and doing a granular check for all the changes on a column-by-column basis.

We should change it so that it is incorporate at each step in the pipeline, so that the source data for a single service is passed into the data-diffing and compared (if possibly given availability in the cache) with the previous run's corresponding data. This should be done off of the pipeline-integration branch, which has the latest geoparquet cached file to read from rather than postgres.

Additionally, we should move to check distributional changes i.e. summary statistics for mean, median, standard deviation, range on a column-by-column basis rather than a full diff on all the changes.

May 14 '25 11:05 cfreedman