Floats and integers treated as text?
When profiling an integer column from a pandas dataframe, the column appears to go through the text profiler. See for instance the sample values in the result below, or the fact that the column goes through label detection to determine that these are indeed integers (which seems unnecessary and a bit wasteful, especially as it implies casting the values to strings).
Is there a way to apply only numerical profiling to numerical columns?
Here is an extract from the result:
"data_stats": [
{
"column_name": "nb_orders",
"data_type": "int",
"data_label": "INTEGER",
"categorical": true,
"samples": ["17", "9", "1", "4", "2"],
"statistics": {
"data_label_representation": {
"INTEGER": 0.998,
"FLOAT": 0.0,
"QUANTITY": 0.0,
"ORDINAL": 0.002
}
}
}
]
In a sense, this is a symmetrical issue to the one I just submitted: https://github.com/capitalone/DataProfiler/issues/409 There I had strings treated as numerical values, here I have numerical values treated as strings.
@ian-contiamo label detection is intended to be applied to numerical columns as well. We do this because there are labels which are only integer values, e.g. an SSN can be represented without the hyphens to break it apart such as XXXXXXXXX.
You are correct in your assumption that the data is sent to all profiler simultaneously. We aren't using pandas preset values for evaluation, but is something we could consider. Instead, we use multiprocessing to send it to all profilers simultaneously as we assume no knowledge of the data at input.