DataProfiler
DataProfiler copied to clipboard
Add argument to Profiler for samples ratio
Today, there seem to be 2 settings for adjusting the sample size. They are samples_per_update
and min_true_samples
. I can load in my file via Pandas and get the number of rows if I want to profile the whole thing. For example:
pandas_df = pd.read_parquet("myfile.parquet")
profile = Profiler(data, samples_per_update=pandas_df.shape[0])
I was just thinking it would be nice to add an additional flag like samples_ratio
which would be a value between 0-1 denoting the percentage of data that you want to sample. This would mean you wouldn't have to essentially load the data in twice, you could just say I want X percentage loaded in as samples and it would go from there.
Hey @carlsonp! Thanks for opening the issue and the idea presented.
This makes a ton of sense and I think fits perfectly as a feature into the DataReaders
class (as documented here).
There are two features, although not percentages, that exist for CSV and Parquet:
I think something like a percentage sampling would be a nice addition to the readers: read in sampled as desired and pass the pre-sampled data to the profiler.