DataProfiler icon indicating copy to clipboard operation
DataProfiler copied to clipboard

Add argument to Profiler for samples ratio

Open carlsonp opened this issue 1 year ago • 1 comments

Today, there seem to be 2 settings for adjusting the sample size. They are samples_per_update and min_true_samples. I can load in my file via Pandas and get the number of rows if I want to profile the whole thing. For example:

pandas_df = pd.read_parquet("myfile.parquet")
profile = Profiler(data, samples_per_update=pandas_df.shape[0])

I was just thinking it would be nice to add an additional flag like samples_ratio which would be a value between 0-1 denoting the percentage of data that you want to sample. This would mean you wouldn't have to essentially load the data in twice, you could just say I want X percentage loaded in as samples and it would go from there.

carlsonp avatar Feb 12 '24 22:02 carlsonp

Hey @carlsonp! Thanks for opening the issue and the idea presented.

This makes a ton of sense and I think fits perfectly as a feature into the DataReaders class (as documented here).

There are two features, although not percentages, that exist for CSV and Parquet:

I think something like a percentage sampling would be a nice addition to the readers: read in sampled as desired and pass the pre-sampled data to the profiler.

taylorfturner avatar Feb 13 '24 11:02 taylorfturner