woodwork
woodwork copied to clipboard
Add a logical type Currency
Currently woodwork will detect data of the form
$1.234
$5.678
...
as "Natural Language". It would be helpful if we created a currency type so that this sort of data gets picked up as numeric by default.
- This issue needs to handle removing the currency symbol from the column.
- We need to design how to parse, inference, and clean up this Logical Type.
Which currencies? Currency math typically operates using decimals. Would these types use fixed point storage?
Building on the questions raised by @BoopBoopBeepBoop , we would also need to think about the different global currency symbols used as well as different usage of commas and decimal points around the world unless we wanted this to be a US-only inference.
@BoopBoopBeepBoop yeah good questions! I think the physical type for logical type "Currency" should be a float.
@thehomebrewnerd could something like this library help with parsing and with internationalization?
I do think there's value to be had in starting with one localization (e.g. USD), to get the feature working, and to then add support for other localizations.
It occurs to me, it would be cool if woodwork tagged currency columns with the name of the currency they use!
If a user passes a column with datetime values [12/12/12, 1/1/1, 2/2/2]
, we will infer/change theses values to be a datetime and add Hour, Minute, Second.
So passing a column with currency symbols, changing to a Float, and storing the currency information elsewhere is not a radically different behavior.
In the finance industry its rare to have a numerical value and currency stored in a single feature.
Instead normally you'd have something like Dividend_currency. (iso4217 3 letter code) Dividend_amount (numeric)
Dozens of countries use "dollar" as their unit so "$300" is very ambiguous about which currency the user means.
There would be value in
- creating a logical woodwork feature to store and detect currency codes (iso4217 symbols)
- making a feature primitive in FeatureTools that detects column pairs like the above and automatically converting to USD using a recent exchange rate. That would permit rows in different currencies to be compared more fairly based on value. For example, 14000 IDR is around 1 USD is unless we convert to USD the IDR value looks like something "high value".
I am happy to work on (1). I'll open a new feature for this.
@willsmithorg You would have to tag the numeric column, perhaps with a currency
tag to specify that its currency values. This semantic tag (currency) could work with Double OR Integer. Once you have this, the Featuretools primitive would be defined as such
class CurrencyToUSD(TransformPrimitive):
name = "currency_to_usd"
input_types = [ColumnSchema(semantic_tags={'currency', 'numeric'}), ColumnSchema(logical_type=CurrencyCode)]
return_type = ColumnSchema(semantic_tags={'numeric'})
Another way to tackle this problem is to turn currency into a tuple of values, similar to LatLong. Each value could then have a corresponding currency symbol/code.
A related article that might be useful when working on a solution for this: https://cs-syd.eu/posts/2022-08-22-how-to-deal-with-money-in-software