woodwork icon indicating copy to clipboard operation
woodwork copied to clipboard

Add a logical type Currency

Open dsherry opened this issue 3 years ago • 8 comments

Currently woodwork will detect data of the form

$1.234
$5.678
...

as "Natural Language". It would be helpful if we created a currency type so that this sort of data gets picked up as numeric by default.

  • This issue needs to handle removing the currency symbol from the column.
  • We need to design how to parse, inference, and clean up this Logical Type.

dsherry avatar Apr 02 '21 15:04 dsherry

Which currencies? Currency math typically operates using decimals. Would these types use fixed point storage?

BoopBoopBeepBoop avatar Apr 02 '21 15:04 BoopBoopBeepBoop

Building on the questions raised by @BoopBoopBeepBoop , we would also need to think about the different global currency symbols used as well as different usage of commas and decimal points around the world unless we wanted this to be a US-only inference.

thehomebrewnerd avatar Apr 02 '21 15:04 thehomebrewnerd

@BoopBoopBeepBoop yeah good questions! I think the physical type for logical type "Currency" should be a float.

@thehomebrewnerd could something like this library help with parsing and with internationalization?

I do think there's value to be had in starting with one localization (e.g. USD), to get the feature working, and to then add support for other localizations.

It occurs to me, it would be cool if woodwork tagged currency columns with the name of the currency they use!

dsherry avatar Apr 02 '21 16:04 dsherry

If a user passes a column with datetime values [12/12/12, 1/1/1, 2/2/2], we will infer/change theses values to be a datetime and add Hour, Minute, Second. So passing a column with currency symbols, changing to a Float, and storing the currency information elsewhere is not a radically different behavior.

gsheni avatar Apr 28 '21 16:04 gsheni

In the finance industry its rare to have a numerical value and currency stored in a single feature.

Instead normally you'd have something like Dividend_currency. (iso4217 3 letter code) Dividend_amount (numeric)

Dozens of countries use "dollar" as their unit so "$300" is very ambiguous about which currency the user means.

There would be value in

  1. creating a logical woodwork feature to store and detect currency codes (iso4217 symbols)
  2. making a feature primitive in FeatureTools that detects column pairs like the above and automatically converting to USD using a recent exchange rate. That would permit rows in different currencies to be compared more fairly based on value. For example, 14000 IDR is around 1 USD is unless we convert to USD the IDR value looks like something "high value".

I am happy to work on (1). I'll open a new feature for this.

willsmithorg avatar Dec 28 '21 21:12 willsmithorg

@willsmithorg You would have to tag the numeric column, perhaps with a currency tag to specify that its currency values. This semantic tag (currency) could work with Double OR Integer. Once you have this, the Featuretools primitive would be defined as such

class CurrencyToUSD(TransformPrimitive):
    name = "currency_to_usd"
    input_types = [ColumnSchema(semantic_tags={'currency', 'numeric'}), ColumnSchema(logical_type=CurrencyCode)]
    return_type = ColumnSchema(semantic_tags={'numeric'})

gsheni avatar Dec 29 '21 16:12 gsheni

Another way to tackle this problem is to turn currency into a tuple of values, similar to LatLong. Each value could then have a corresponding currency symbol/code.

gsheni avatar Dec 29 '21 16:12 gsheni

A related article that might be useful when working on a solution for this: https://cs-syd.eu/posts/2022-08-22-how-to-deal-with-money-in-software

thehomebrewnerd avatar Aug 25 '22 15:08 thehomebrewnerd