skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Add a transformer that uses heuristics to parse string columns that contain mixed strings like 10$/100 kg

Open rcap107 opened this issue 5 months ago • 16 comments

Problem Description

Sometimes dataframes contain numerical data that includes a string to denote the unit of measurement, like a column about money that has "$ 10", or a column that contains percentages like "29%".

As a result, it is necessary to write a custom function that parses the string, then either trims out the unit, or splits the column into two.

Feature Description

It would be nice to have a transformer that implements a few heuristics to do this automatically.

For example, it could use a regex to detect numbers<space> string, or string<space>numbers.

We might want to allow the user to provide the regex themselves.

I think this would be especially convenient with ApplyOnCols.

It could also be an optional step in the Cleaner, e.g., by "extending" the ToFloat transformer to also handle common formats. If we go in this direction, it probably should not take custom formats and just use whatever heuristics we come up with.

Alternative Solutions

No response

Additional Context

No response

rcap107 avatar Jul 23 '25 19:07 rcap107

rcap107 created an issue (skrub-data/skrub#1542)

Problem Description Sometimes dataframes contain numerical data that includes a string to denote the unit of measurement, like a column about money that has "$ 10". As a result, it is necessary to write a custom function that parses the string, then either trims out the unit, or splits the column into two. Feature Description It would be nice to have a transformer that implements a few heuristics to do this automatically. For example, it could use a regex to detect numbers string , or stringnumbers . We might want to allow the user to provide the regex themselves. I think this would be especially convenient with ApplyOnCols. It could also be an optional step in the Cleaner, e.g., by "extending" the ToFloat transformer to also handle common formats. If we go in this direction, it probably should not take custom formats and just use whatever heuristics we come up with. Alternative Solutions No response Additional Context No response

— Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you are subscribed to this thread. Message ID: <skrub-data/skrub/issues/1542 @ github . com>

GaelVaroquaux avatar Jul 23 '25 20:07 GaelVaroquaux

@GaelVaroquaux your reply is broken

rcap107 avatar Jul 23 '25 20:07 rcap107

Yes, sorry, I'm in transit and messing things up.

I wanted to say: for now, this is a good usage of DataOps isn't it?

GaelVaroquaux avatar Jul 24 '25 08:07 GaelVaroquaux

The reason I say this is that I wonder if we should strive to implement as Transformers every dataframe operation. Those that don't have a state that it "fit" don't need a separate fit and transform

GaelVaroquaux avatar Jul 24 '25 08:07 GaelVaroquaux

I'm not sure I understand the point here: is it about implementing the feature as a "transformer" object (meaning, the feature is useful, but not as a transformer), or is it because the feature itself is not useful because it can be implemented with DataOps?

I don't think we should use "it can be done with DataOps" as the criterion to use to decide whether to add encoders or not, because DataOps can do anying (in the good and in the bad).

My idea for this feature is to save users the effort of writing up the heuristic, and I think that would fit how a lot of the current transformers are already used in the library. It would be another "convenience transformer" to add to the list of transformers that are already available.

rcap107 avatar Jul 24 '25 12:07 rcap107

We discussed this object a bit with @Vincent-Maladiere

We both agree that the feature should be available somewhere, at least with some simple heuristics (e.g., to parse "$10$, or "40 kg"), however the specifics of the implementation are not obvious.

One idea is to implement something similar to transformers like ToFloat, and to add it to the list of transformers done by the Cleaner and TableVectorizer, so that the columns are parsed by default when the data is prepared. Something to consider with this approach is that very often columns will have the same unit of measurement in a column, and if we keep both the numerical and the "label" part of the string, we might be adding a constant column. If we apply this transformation before DropUninformative is used, then the constant column can be removed, but this may not be obvious to the user.

Another alternative is to have a specific transformer that can be accessed by the user, and that can be used using ApplyToCols, or with DataOps on the specified columns. This has the advantage of allowing the user to select the columns they want to run the heuristic on (useful when they know which columns contain currencies, for example), as well as passing their own heuristics.

Another idea is to implement the logic as a simple function, rather than a transformer. In this case there isn't a state to save, so there should be no practical difference between a function and a transformer.

I still think we should have this feature as it would be another "we coded it for you" feature to add to skrub's collection, but I am not sure how to add it.

rcap107 avatar Jul 28 '25 13:07 rcap107

I'm going to start working on this during the sprint skrub sprint x WiMLDS

gabrielapgomezji avatar Oct 29 '25 08:10 gabrielapgomezji

awesome @gabrielapgomezji ! I think a first step could be looking for example datasets with such columns. I suspect in many cases it won't be quite as clean as "40 kg" but look more like "less than 5", "1 y 1/2", "45 x 25 x 10", etc. so looking at examples and seeing how many can be addressed by a simple regex would be useful

jeromedockes avatar Oct 29 '25 10:10 jeromedockes

it won't be quite as clean as "40 kg" but look more like "less than 5", "1 y 1/2", "45 x 25 x 10",

But do we want to address these cases for now? I fear that we are going to have to deploy a lot of efforts, and come up with fragile solutions, while addressing the "40 kg" problem is probably already useful.

That said, I agree with Jerome that having a good set of examples is really useful to specify what we want to do.

GaelVaroquaux avatar Oct 29 '25 10:10 GaelVaroquaux

it won't be quite as clean as "40 kg" but look more like "less than 5", "1 y 1/2", "45 x 25 x 10", But do we want to address these cases for now? I fear that we are going to have to deploy a lot of efforts, and come up with fragile solutions, while addressing the "40 kg" problem is probably already useful.

That said, I agree with Jerome that having a good set of examples is really useful to specify what we want to do.

I agree with @GaelVaroquaux that we should start with simple heuristics where we check a limited set of possible symbols (currencies, %, possibly cases like "30 kg" where it's a number and a single string separated by space).

I also don't think this should be on by default, and that it should be the user that decides to apply this transformer to specific columns where the syntax isn't too complicated (eg with ApplyToCols). That should also limit the complexity of cases we are considering.

I'll look for some examples, I think they should be somewhat common on kaggle

rcap107 avatar Oct 29 '25 10:10 rcap107

But do we want to address these cases for now?

I don't think we ever want to address these cases (they are too varied to come up with a good solution). My point is we should have an idea of what is the ratio of "40kg" vs more messy ones "in the wild", because if the case we can address happens 5% of the time it might not be worth adding a transformer for it. when it is as simple as a single unit quite frequently the column might already be numeric with the unit in the column name (or not written anywhere and left for consumers of the dataset to guess 😅 ), so what causes the column to be text with units is often exactly the fact that it is more complex than number+unit, which causes the creators of the dataset to resort to semi-natural language to describe some of the values

jeromedockes avatar Oct 29 '25 10:10 jeromedockes

Also if the simple case is very easily handled with pandas or polars directly, for example with something like

>>> df
    weight
0     4 kg
1  2.3 kg 
>>> df['weight'].str.extract(r'(?P<value>\d+(?:.\d+)?)\s*(?P<unit>\w*)\s*').astype({'value': float})
   value unit
0    4.0   kg
1    2.3   kg

we should make sure that the skrub function is not too "shallow", ie it might provide similar functionality as pandas .str without a significantly more high-level abstraction / significantly simpler interface, so for users it may be easier to just write the pandas one-liner than figure out how to use the skrub function, especially if the skrub one fails to cover a good proportion of their use-cases.

When looking at examples it would be useful to write the full processing that would be done for that dataset, for example one might want to convert all values to a common unit to get a meaningful numerical column. in that case the parsing might be only a small part of the needed custom processing. that can be either a problem (the transformer is less useful if it only does a small fraction of the work) or an opportunity for skrub to provide deeper functionality / more value (eg offering to convert measurements that are in a recognizable unit eg for all SI units)

jeromedockes avatar Oct 29 '25 11:10 jeromedockes

I don't think we ever want to address these cases (they are too varied to come up with a good solution). My point is we should have an idea of what is the ratio of "40kg" vs more messy ones "in the wild",

I'm with you :).

In my hospital data, I get a lot of problems of units

GaelVaroquaux avatar Oct 29 '25 13:10 GaelVaroquaux

A few examples from kaggle:

  • https://www.kaggle.com/code/khotijahs1/cars-price-prediction
  • https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data
  • https://www.kaggle.com/datasets/abdulmalik1518/cars-datasets-2025
  • https://www.kaggle.com/datasets/ayushparwal2026/cars-dataset
  • https://www.kaggle.com/datasets/abdelrahmanemad594/laptop-prices
  • https://www.kaggle.com/datasets/mohamedasak/imdb-top-250-movies

All these contain at least a few columns that include units, some would require more complex preprocessing than others

rcap107 avatar Oct 29 '25 13:10 rcap107

I foound in this repo some more datasets that contain columns with number and string in the same column. Some are simple:

  • Just define location aisle (values in the column are "aisle 4", "aisle 5") https://github.com/eyowhite/Messy-dataset/blob/main/warehouse_messy_data.csv | https://github.com/eyowhite/Messy-dataset/blob/main/cleaned_warehouse_messy_data.csv
  • Income ( the only remark is that some values come with $ and some others without) https://github.com/eyowhite/Messy-dataset/blob/main/messy_IMDB_dataset.csv

But other datasets are more complex examples where there's not only a value and its unit but a range which might not be the interes.

  • Years ("5-7 years") https://docs.google.com/spreadsheets/d/1IPS5dBSGtwYVbjsfbaMCYIWnOuRmJcbequohNxCyGVw/edit?resourcekey=&gid=1625408792#gid=1625408792 (https://oscarbaruffa.com/messy/)
  • Size range (values in the column of size are as "1001 to 5000 employees". Values in the column of revenue are "10 billions to 100 billions $") https://github.com/eyowhite/Messy-dataset/blob/main/Cleaned_DS_jobs2.csv

gabrielapgomezji avatar Oct 29 '25 13:10 gabrielapgomezji

The idea for this, to implement a first simple version, would be to create a transformer class such that we pass a dictionary with the units and conversion factor:

unit_dictionary = {
km: 1000,
m: 1,
mm: 0.001}

and the idea would be to follow these steps:

lower_row = row.lower_case()
value, unit = parse_string(lower_row)
result = value * unit_dictionary["unit"]

column_name = row_name + f"{unit}"

gabrielapgomezji avatar Oct 29 '25 15:10 gabrielapgomezji