DataProfiler icon indicating copy to clipboard operation
DataProfiler copied to clipboard

Column profiled as int but should be text/string

Open carlsonp opened this issue 4 months ago • 1 comments

General Information:

  • OS: Ubuntu 22.04
  • Python version: 3.10.12
  • Library version: 0.10.9

Describe the bug:

I have a parquet file column org_number that should be treated as text but is being profiled into an int.

Pandas info reports it as an object:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26943 entries, 0 to 26942
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
<snip>
 2   org_number         26943 non-null  object
<snip>

When I use Pandas describe(), it doesn't show any numeric statistics like min, max, stddev, etc. which is correct.

The output from the profiler:

{
            "column_name": "org_number",
            "data_type": "int",
            "categorical": false,
            "order": "random",
            "samples": "['01321', '07618', '08257', '02321', '09123']",
            "statistics": {
                "min": 1.0,
                "max": 105121.0,
                "mode": "[6781.24]",
                "median": 6573.749,
                "sum": 220034705.0,
                "mean": 8166.6743,
                "variance": 150256092.2856,
                "stddev": 12257.8992,
                "skewness": 5.3242,
                "kurtosis": 30.6063,
                "histogram": {
                    "bin_edges": "[  1.        , 363.48275862, ... , 104758.51724138, 105121.        ]",
                    "bin_counts": "[ 259.,  539.,  126., 1006., 2057., ... , 0., 0., 0., 0., 7.]"
                },
                "quantiles": {
                    "0": 3350.0226,
                    "1": 6573.749,
                    "2": 8726.115
                },
                "median_abs_deviation": 2195.6598,
                "num_zeros": 0,
                "num_negatives": 0,
                "times": {
                    "min": 0.0001,
                    "max": 0.0001,
                    "sum": 0.0001,
                    "variance": 0.0002,
                    "skewness": 0.0046,
                    "kurtosis": 0.0046,
                    "histogram_and_quantiles": 0.0042,
                    "num_zeros": 0.0002,
                    "num_negatives": 0.0001
                },
                "unique_count": 1367,
                "unique_ratio": 0.0507,
                "sample_size": 26943,
                "null_count": 0,
                "null_types": [],
                "null_types_index": {},
                "data_type_representation": {
                    "datetime": 0.0,
                    "int": 1.0,
                    "float": 1.0,
                    "string": 1.0
                }
            }
        },

To Reproduce:

The code I'm using:

data = dp.Data(filename)
profile_options = dp.ProfilerOptions()

df = pd.read_parquet(filename)
print(df.info())

profile_options.set({
    "structured_options.data_labeler.is_enabled": False,
    "unstructured_options.data_labeler.is_enabled": False,
    "structured_options.correlation.is_enabled": False,
    "structured_options.multiprocess.is_enabled": True,
    "structured_options.chi2_homogeneity.is_enabled": False,
    "structured_options.category.max_sample_size_to_check_stop_condition": 1,
    "structured_options.category.stop_condition_unique_value_ratio": 0.001,
    "structured_options.sampling_ratio": 1.0,
    "structured_options.null_replication_metrics.is_enabled": False
})

profile = dp.Profiler(data, options=profile_options)
human_readable_report = profile.report(report_options={"output_format":"pretty"})

with open("reportfile.json", "w") as outfile:
    outfile.write(json.dumps(human_readable_report, indent=4))

I can't provide the raw data but I can test things. The data is interesting in that it's almost integer, but many of the entries have 0's prepended as you can see in the samples.

Expected behavior:

I would expect the type to be string/text.

Screenshots:

Additional context:

carlsonp avatar Apr 03 '24 20:04 carlsonp