DataProfiler icon indicating copy to clipboard operation
DataProfiler copied to clipboard

Fuse the functionality used in both `_merge_histogram` and the newly created `_assimilate_histogram`

Open ksneab7 opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe. In an effort to adhere to the goal of achieving a clear paradigm of one, easy to understand, path for each of the following tasks for profiling: Updating, Getting, and Merging This issue focuses on clearing up the path to defining how to merge a profile (or parts of a profile) with a singular function path to achieving this goal.

The problem this issue addresses is the use of both _merge_histogram and the newly created _assimilate_histogram as well as other merging processes within the dataprofiler that repeat functionality/have overlapping goals for input and output.

An example of a fix for achieving this paradigm is as follows: We have implemented a much better way to put information from two histograms together with the creation of _assimilate_histogram and we should be able to use that function throughout the code while also achieving the previously desired functionality of _merge_histograms. We can see the old way of doing this in numerical_column_stats.py on line 1286. This recreates the histogram data which is more memory intensive than doing it the way we do in _assimilate_histogram.

Describe the outcome you'd like: I would like a singular path to merging profiles and their information that achieves the success of all currently existing functions usage.

Additional context: For detail behind _assimilate_histogram the PR: https://github.com/capitalone/DataProfiler/pull/815 Implements the more memory optimized solution

ksneab7 avatar May 23 '23 17:05 ksneab7

Summary of the new paradigm for histograms.

merge: (Built histogram + built histogram) update: (new data -> get hist on new data-> built histogram) + (existing built histogram)

JGSweets avatar May 23 '23 17:05 JGSweets

All calculations should have a get, update, and merge.

Where get -> calcs from raw data. merge -> takes two existing calcs and merges them update -> takes in new data to add; get + merge

JGSweets avatar May 23 '23 17:05 JGSweets