lightweight_mmm
lightweight_mmm copied to clipboard
Response Curve : Adstock and Carryover models
Hi team,
I used the package to fit both adstock and carryover models. I have 6 media channels, 2 extra features and states as geo with 3.5 yrs of weekly historical data. The n_eff and r_hat values look good with no divergences. I have couple of questions regarding the model output
- The lag weight parameter for couple of channels are close to 1. This is counter intuitive as one of the channel is search related and it would not make sense for it to carry over 100% of value day over day.
- The response curve from the model for all channels are linear (both adstock/carryover) and again doesn't make sense to have all channels with no saturation. I checked the 'exponent' parameter for the channels and the values are high close to 1 as well.
Looking for any suggestions to understand these outputs, and for any idea to improve the model fit. Thanks!
-
Sounds like a good use case for media channel specific transform priors. The default is
dist.Beta(concentration1=2., concentration0=1.), this assumes most-probably adstock is 1, although this is relatively weak. I believe you can do that like so:conc_1 = np.array([2] * len(media_channels) c1[ch_idx] = 5 c0 = np.ones(len(media_columns) c0[ch_idx]= 5 custom_priors = {_LAG_WEIGHT: {'concentration1': c1, 'concentration0': c0}}This would be a centralised at 0.5 normal-esque shape for channel at index: ch_idx, although personally I usually find it a hard transform to fit to form a mono-polar posterior, without either biasing high/low, at least with my data. I use channel specific priors to put prior beliefs of low adstock/carryover effects i.e. lower values for _LAG_WEIGHT, _PEAK_EFFECT_DELAY and _AD_EFFECT_RETENTION_RATE, for my search channels, and higher for my tv/display channels. -
The exponent saturation fn is highly biased to no-saturation by default
_EXPONENT: dist.Beta(concentration1=9., concentration0=1.)You can try to reduce this bias, by reducing concentration1 to make a longer LHS tail, and increase concentration0 > 1 to reduce the likelihood of 1-valued exponent i.e. No Saturation. You can explore better beta distributions here: https://homepage.divms.uiowa.edu/~mbognar/applets/beta.html Personally I find the exponent a restrictive-transformation to replicate saturation effects, if you have prior belief of "strong saturation", as an exponent x**a, it pivots around (1,1 ) it will struggle to completely saturate a channel at higher transformed media values > 2 (200% mean channel value), without a very steep gradient between media levels 0 to 100% of mean. (Presuming you used the jnp.mean CustomScaler) Also note - the response curves plotted I believe use predictions by appending the newly trialled media values to the previous media used in the training data, so the relative response includes adstocked over effects from the previous media. So the response curve can be quite dampened. But a linear response is expected when your exponent is close to 1. Personally I prefer using the logistic_saturation function from pymc-marketing's MMM solution to model saturation, but that might be more customisation than it's worth, and no saturation fn is without it's assumptions/constraints. https://github.com/pymc-labs/pymc-marketing/blob/main/pymc_marketing/mmm/transformers.py
@becksimpson Thanks for the suggestions. Yes, I explored using custom priors for the particular channel and other channels as well. One issue I am seeing is that using custom priors lead to media coefficients vary drastically.
One of the use-case I am trying to use MMM is to understand how much target I can get with different budget. When I run budget optimization, I am seeing the predictions are way off (much higher than realistic value, given response curves are linear). Currently checking my code and looking deeper into budget optimizer settings.
@becksimpson
I am not sure I followed this completely - just starting with lightweightmmm but is there a way here to provide custom priors for the betas for a channel, where the priors can be different per channel?
@BrianMiner yes. The beta's you are referring to, are these the beta coefficients that are multiplied to the adstocked-saturated media data? The beta coefficients that represent ~roughly the effectiveness, or ROI, of a channel?
B x saturated( adstocked(media data))?
If so, you pass the prior belief of these values in using the media_prior, so you don't have to use custom_priors. LightweightMMM model's these priors as HalfNormals.
mmm.fit(media=media_data,
extra_features=extra_features,
media_prior=costs, // typically we assume the ROI of a channel is aligned with its spend ratio
target=target,
number_warmup=1000,
number_samples=1000,
number_chains=2)
They discuss how to scale and produce these costs more on the main README.md of this repository.
If you are talking about passing in custom priors for media transformation parameters that are beta distributions, for example _LAG_WEIGHT (the carryover effect for a media channel) is modelled using a beta distribution. Then that is what I was originally discussing above.
@becksimpson Thanks for the reply! I should have been more detailed. I am indeed talking about priors on the coefficients that get multiplied against the saturated adstock. I am looking for a way to specify the prior for each of them, different per channel, using prior knowledge gained from experimental designs as a way to calibrate the MMM.
@BrianMiner Yes, the media_prior is how LightweightMMM supports this out of box. They give the example of using 1-mean scaled spend %'s.
This package only learns a fixed distribution for these Beta Coefficients, and starts with a HalfNormal prior, with scale equal to the passed in values in media_prior. So each channel beta coefficient (i), starts as a HalfNormal(scale=media_prior[i])
So if your media_data is (n, c) shape where n- number of temporal samples, c - number of channels, then media_prior can really be any (c,) numpy array, where each entry corresponds to the media data channel, of that same index, i.e. order preserving. You don't have to use cost %'s. If you roughly had an pre-conception for each channels ROI, from experiments, you could just 1-mean scale an array of these.
with numpyro.plate(
name="channel_media_plate",
size=n_channels,
dim=-2 if media_data.ndim == 3 else -1):
coef_media = numpyro.sample(
name="channel_coef_media" if media_data.ndim == 3 else "coef_media",
fn=dist.HalfNormal(scale=media_prior))
The reason spend % are typically used, is in the absence of experimental knowledge, these are probably the best indicators we have of expected return, and will at least then to be proportional. i.e. if 50% of my spend is on FB, I'd expect a higher return than Twitter if that's where 5% of my spend is, as a channel being 10x performant is quite unlikely.
I see - so you are saying .....
From an experiment, you think that Channel A roughly has an effect of 0.02 which is that each additional impression of media in Channel A produces $0.02 in revenue. Then, we would divide 0.02 by the mean of the target variable (matching the model build) and next see what sigma value for a half Normal would produce a distribution that is around that (0.02 / mean of target ) value? Does that work or does lightweightmmm further adjust the number or take exactly what is passed in?
The expected transformations of the media data, partially depends on the saturation function. Hill saturation I believe maps input to 0 --> 1 range (used in hill_adstock model). While exponent saturation (used in adstock, carryover) models, is unbounded. I personally ignore the impact of adstock (with normalization (the default) it is largely scale preserving), and saturation when considering an appropriate media prior, for weakly informative priors like HalfNormals, I'm of the opinion we're more interested in ballpark/ relative scaling, and ignoring the impact of saturation/adstock can help simplify things.
For exponent saturation in particular the strongest prior belief of the model, is of near no-saturation (x ** 1 = x), therefore with scale-preserving carryover effects, there is nothing other than the ROI Beta Coefficient mapping the overall scale of the 1-mean scaled media channels, to the 1-mean scaled target.
The HalfNormal's that LightweightMMM use are weakly informative priors, so the important things are that:
- try to ensure the HalfNormal prior distribution covers the "true roi" with reasonable non-zero probability (this can be detected when you look at the posterior/prior distributions of the Beta Coefficients in the model post-fit, you don't want all posteriors to be systematically extremely higher, or systematically extremely lower than the original prior distributions)
- the relative scaling of media priors reflects our belief in their relative effectiveness (i.e. FB has scale 2x Twitter, if we believe it is 2x effective - presuming they had the same original media scale)
- the relative scaling of media priors reflects their original scale (i.e. FB is a 10x larger channel, it should have a 10x higher Beta Coefficient prior, as all channels are scaled to 1-mean)
The difficult thing is that typically both your individual media data channels are 1-mean scaled, and target is 1-mean scaled. Your ROI scale prior, should represent both the original scale of the media channel (e.g. FB has 5x the scale of Twitter in terms of investment - i.e. spend), and the effectiveness of the media channel (e.g. Twitter has 2x expected ROI per £ spent).
To match how LightweightMMM operates with total spend, I think you can take all your channel ROI estimates x Total spend in that channel, and 1-mean normalize them, then you're taking into account the expected channel effectiveness and their original spend scale. So a channel with more spend has a higher ROI prior. A channel with higher ROI from exps has a higher ROI prior.
Using total spend does penalise channels that are only turned on only for part of their history. (They mention this in the README.md). So instead of summing/meaning spend over your training data to get total spend. You can take the average spend on day when the channel is turned on, the non-zero mean.
CH_PRIORS = 1_MEAN_SCALE( Avg_spend_when_channel_is_on x Expected_ROI_£_from_£_from_exps)
If you have ROI experiment results only for £: imp returns, and all your channels are imps, I'm pretty sure you could also:
CH_PRIORS = 1_MEAN_SCALE( Avg_imp_when_channel_is_on x Expected_ROI_£_from_imp_from_exps)
This might arguably be better, as spend, already encapsulates some presumption of ROI, i.e. a marketing channel can charge more per imp, if there is already belief in the market they will lead to higher return.
I did a similar thing, for Organic Channels (in a slightly altered version of this package), where my priors for my Organic Channels were scaled originally by their 1-mean scaled avg Visitors per day. I.e. a Organic Channel has expected return proportional to its number of daily visitors relative to other Organic Channels. Later from a different data source I got their Conversion Rate from customers we do have click data for. Then my prior belief HalfNormal scales for their Beta (ROI) coefficients became some scaling, of their daily organic visitors x CVR for each channel.
As always choosing of appropriate priors can be an incremental process, and this is just one of many approaches.
@becksimpson Thanks for such a detailed response! I wanted to give it some thought and provide a thoughtful response.
From your reply:
- I am not sure exactly what you mean by
Expected_ROI_£_from_imp_from_exps. Can you describe? - From
1_MEAN_SCALE( Avg_imp_when_channel_is_on x Expected_ROI_£_from_imp_from_exps)what does1_MEAN_SCALErefer to? Does it mean dividing the valueAvg_imp_when_channel_is_on x Expected_ROI_£_from_imp_from_expsby the mean of the target (i.e. revenue) from the model? - Do you input this value
1_MEAN_SCALE( Avg_imp_when_channel_is_on x Expected_ROI_£_from_imp_from_exps)into the media_prior parameter?
What do you think about this example, am I on the right track with what a prior should look like:
- We run an experiment where media channel 'A' is tested for its incremental revenue (a geo experiment with some markets being held out and not getting the media):
- 2520000 impressions
- 1120000 in revenue generated from these impressions
- Hence, each impression we think generated 0.44 in revenue
We would expect the MMM coefficient for this media channel which represents the revenue for each additional impression to be around 0.44.
From the MMM dataset we are using for the model, we see the following weekly means (the MMM uses weekly data):
- Media channel 'A' impressions : 3,700,000
- Target of the model (revenue) : 8,500,000
So we should expect the prior for the model to have a lot of mass at 0.019. This is from (1120000/8500000) / (2520000/3700000) which just calculates the revenue per impression after scaling the experiment values by the means of the model data.
Does this sound right? We would like the MMM to not create estimates which are far from experiments. Calibrating MMMs seems really important.
@becksimpson wondered what you thought about the above?
@BrianMiner Sorry for the delay in response and confusion:
- Expected_ROI_£_from_imp_from_exps. This would be your ROI, £ for example, of your target KPI, for a given number of impressions, you'd expect from your experiments. So £0.0005 per/impression. These would vary per channel. So how you make your priors depends on whether the ROI estimates you have from your experiments are per impression, or per channel spend. You can see the two options I gave. We're just multiplying a channels average non-zero spend, by their expected ROI for every £ to get their relative prior. Or we're multiplying a channel's average non-zero impressions by their expected ROI for every imp to get their relative prior.
- No. It means we divide this array by it's mean, to produce an array with 1-mean, using the CustomScaler they show in the example. We're keeping the relative effectiveness of the channels but normalizing it. So let's say you have the following information: Facebook: ROI_££ = 1.2. Avg_Nonzero£ = 2000 Twitter: ROI_££ = 1.7 Avg_Nonzero£ = 300 Google: ROI_££ = 1.3 Avg_Nonzero£ = 500 Priors = 1_mean_scale([ 1.2 * 2000, 1.7 * 300, 1.3 * 500)]) = 1_mean_scale([2400, 5100, 650]) = [0.88343558, 1.87730061, 0.2392638 ]. Here ROI_££ represents from your experiments, your expected return (KPI in £) for every £ you spend on this channel. Avg_Nonzero£ represents your average non-zero daily spend.
- Yes. Now you might want to rescale this again. I have seen in their examples they often multiply by 0.15.
To correct your message above
We run an experiment where media channel 'A' is tested for its incremental revenue (a geo experiment with some markets being held out and not getting the media): 2520000 impressions 1120000 in revenue generated from these impressions Hence, each impression we think generated 0.44 in revenue We would expect the MMM coefficient for this media channel which represents the revenue for each additional impression to be around 0.44. Let's say we do the same for channel 'B' and get 0.6.
From the MMM dataset we are using for the model, we see the following weekly means (the MMM uses weekly data):
Media channel 'A' impressions : 3,700,000 Media channel 'B' impressions: 2,000,000
Priors = 1_mean_scale([ 3,700,000 x 0.44, 2,000,000 x 0.6]) = [1.15134371, 0.84865629] These priors are capturing the scale (number of impressions) and effectivness of the impressions.
Remember the model sees a 1-mean-scaled target KPI (with mean value 1). We pass in N media channels each channel has mean value 1 post-scaling. After adstock effects, the N media channels are still mean value 1 as adstock is scale preserving. After saturation effects, Hill function translates into a 1-max signal, while exponent saturation will also strictly reduce. However in both cases saturation functions tends to reduce the scale only minorly, so I'm going to ignore.
You now have N channels with 1(ish) mean scale, transformed by adstock/saturation, and you are learning the ROI effect, to multiply these channels by so that they sum (with seasonality, baseline, control channels) to the target KPI that has 1-mean. The priors you pass in would be too large if you simply multiplied those by 1-mean channels. That's why HalfNormal is such a good distribution to not strongly inform and bias the model to have the coefficients too high. As we use the priors to set the scale of the HalfNormal prior distributions, which still places the highest probably values as near 0.
The important thing is the priors for your channels are correct relative to eachother based on their scale (number of impressions) and experimental ROI (effectiness of their impressions). How you scale these to make sense in your bayesian model (in the example notebooks, they 0.15 mean scale the array), depends on your prior beliefs of how much media contributes to your KPI.
I feel very dense for some reason on this topic - I love that you are talking through it with me!
I see that you are creating the priors as (avg weekly impression amount * (revenue per impression from the experiment)) and then normalizing the array to have mean of 1 (divide by the average of the array).
-
Are you saying that it would be proper to multiply these priors [1.15134371, 0.84865629] by 0.15 to represent what? The 0.15 represents...?
-
I think I see that this approach produces priors that are correct in their relative weight compared to each other and they are scaled to be mean-1 which I guess makes sense as the target and media are all as well. What I dont see though is how this produces an outcome where the betas on the media impressions (ignoring adstock and saturation) is close to the experiment. In some testing of using lightweightmmm or coding myself in pymc, with no other control variables, the ROI is much too high on the media (too high relative to experiments / prior knowledge). So, I want to constrain them to essentially match or be close to the results of experiments.
So, in my example, I would expect the back transformed beta value to be 0.44 because every impression should cause 0.44 in revenue. So I feel like I want the prior on the beta to have a lot of mass at that value, albeit in the standardized space of media and the outcome of the model (revenue).
If I have 50,000 impressions of media, which would be 50000/3700000 in 1-mean media space you'd multiply it in the model by beta (= 0.19)*** which equals 0.00256757. Given the mean revenue above, to transform this out of 1-mean space to the original value, you'd multiply it by 8500000 which equals (0.00256757 * 8500000 ) 21824.345.
21824.345 / 50000 = 0.4364869
0.4364869 is the revenue per impression, which is what we saw (with round off) in the experiment. Thats why I think I want the mass of my prior to be 0.19 and I want to really force the model to keep the beta close to that.
*** Note: I had a typo above, with my example experiment calculation I should have typed 0.19 instead of 0.019
- Here 0.15 is what they used in their notebook
simple_end_to_end_demo.ipynbto "reflect typical MMM ranges", I've actually never been able to determine where they get their number from.cost_scaler = preprocessing.CustomScaler(divide_operation=jnp.mean, multiply_by=0.15)You can imagine if you have 6 channels, 1-mean scaling 6 channels ROI priors, means their prior ROI's sum to 6. If these ROI coefficients are being multiplied by the 1-mean scaled media channels (ignoring saturation, adstock), that total impact will be significantly more than your 1-mean scaled target, 1 < ~6. So you might want to reduce the scale of these ROI coefficients to bring them to a more reasonable range. Now it's important, that it's better to go too high, than too low, given these are HalfNormal distributions. (So I actually personally do something a little different, I approximately divide my priors by their sum, so they sum to 1, then multiply them by 0.5, but that's because in my modelling, the attribution credit I'm trying to give to paid media is in the region of 10 -> 25%, varies across markets, and I don't have ROI information from exps, so I'm generous in the range of ROI's. I also have other priors based on visitor volume, and CVR for organic channels, which I determine priors for separately) Essentially If you believe that say paid media is driving 80% rather than say 20% of your KPI, you need higher priors. If you are modelling more channels you probably need a lower multiplier. This is probably a case of iterative modelling though. If all your media channel posteriors (for ROI) end up significantly lower than your prior, that's an indicator you've set them too high, and visa versa. - Yes I see what you mean, if you want to be quite prescriptive, 0.19 makes sense, as in, in that scaled world it reflects your ROI experiment number. Something to bear in mind is a) halfNormal priors mean 0.19 scale will bias the posterior to be <0.19 due to the shape of the distribution b) while we've been ignoring saturation, hill function at the very least transforms the media channels to 1-max scaled range, so will decrease your channel. If you want to force the ROI to be near the experiment you might want to investigate altering, the priors to make them stronger, i.e. non-halfnormal such as a truncated gamma distribution etc, with mean 0.19, and range (0.095, 0.38) i.e. (0.5x, 2x), all of this introduces bias to your model. Myself, I've been using truncated HalfNormals (0.25x, 4x) with generous scaling, to compromise the theoretical promise of bayesian combining evidence with prior beliefs, and business-application of the MMM and limited data, where giving 0-credit to a channel, regardless of if no correlation is found between signals, is not viable from a marketing perspective.
For 2 - I am glad you agree with me on the approach for creating a prescriptive prior.
What I was thinking, was not to take the 0.19 and enter it into the half-normal per say, but to find the value which when entered into the half-normal produced a distribution with a lot of mass at 0.19.
I was thinking past experiment(s) should yield a confidence / credible interval. Say from the experiments, we have a 95% interval between 0.1 and 0.23 for the revenue per impression. Then if we had to use a half normal, could do something like this:
If we were really sure of the experiment(s) we could use a gamma etc: