lifetimes
lifetimes copied to clipboard
issue about the meanings of monetary value?
Hello,
It seems that I found the inconsistence meanings when I use calibration_and_holdout_data
function.
My transaction data have 3 fields, customer_id
, finish_time
, monetary value
.
Then I use calibration_and_holdout_data
function
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(
transaction_data,
'passenger_id',
'finish_time',
calibration_period_end='2019-07-01',
observation_period_end='2019-08-01',
monetary_value_col='monetary_value'
)
summary_cal_holdout.head()
and I get:
I found that the meanings of monetary_value_cal
and monetary_value_holdout
are different. The former is the average value for days (frequency='D'
) and the latter is the average value for orders.
The detail is showed below:
As we can see, monetary_value_cal
for id 1 is 29.375 (calculated as sum(money)/sum(distinct day)
), monetary_value_holdout
for id 2 is 7.7125 (calculated as sum(money)/sum(order)
)
Why they are different? I really confused about it.
Your post is a bit confusing. From what I understand, you are referring to the difference between monetary_value_cal
and monetary_value_holdout
?
If that's the case, they should be different, because one calculates the monetary value with respect to the calibration (training) period and the other the holdout (testing) period.
If you're pointing to another issue, please try referring to something more specific inside the calibration_and_holdout_data
function.
At any rate, please avoid posting code screenshots, it is annoying for others who try to help you solve your problem.
Your post is a bit confusing. From what I understand, you are referring to the difference between
monetary_value_cal
andmonetary_value_holdout
?If that's the case, they should be different, because one calculates the monetary value with respect to the calibration (training) period and the other the holdout (testing) period.
If you're pointing to another issue, please try referring to something more specific inside the
calibration_and_holdout_data
function.At any rate, please avoid posting code screenshots, it is annoying for others who try to help you solve your problem.
I mean another issue, the different meaning I mentioned is not about different period, it's about groupby
.
In calibration_adn_holdout_data
function,
The following code is to calculate monetary_value_holdout
and it just groupby customer_id_col
if monetary_value_col:
holdout_summary_data["monetary_value_holdout"] = holdout_transactions.groupby(customer_id_col)[
monetary_value_col
].mean()
However, in the code to calculate monetary_value_cal
, it groupby customer_id_col
and date_time_col
calibration_summary_data = summary_data_from_transaction_data(
calibration_transactions,
customer_id_col,
datetime_col,
datetime_format=datetime_format,
observation_period_end=calibration_period_end,
freq=freq,
monetary_value_col=monetary_value_col,
)
In summary_data_from_transaction_data
function, there is the a function _find_first_transaction
and it groupby customer_id_col
and date_time_col
period_groupby = transactions.groupby([datetime_col, customer_id_col], sort=False, as_index=False)
if monetary_value_col:
# when we have a monetary column, make sure to sum together any values in the same period
period_transactions = period_groupby.sum()
Back to my data example as I mentioned, monetary_value_cal
is the average value for one day for a customer (because groupby customer id and datetime and use sum, then use mean). monetary_value_holdout
is the average value for one order (because just groupby customer id and use mean).
Therefore, if a customer has multiple orders in one day, the calculation method of monetary_value_cal
and monetary_value_holdout
is inconsistent. You can use some simple data to see that this is true.