lightweight_mmm
lightweight_mmm copied to clipboard
What are media_mix_model.trace["media_transformed"], media_mix_model.trace["coef_media"] ? Media Contribution Calculation Understanding
I am trying to understand how the media contribution is calculated, because I would like, as explained in this question, to provide fixed values for some coefficients and see how the model behaves, how changes the R2.
Something I usually do when I work with my Econometric Models with other python libraries.
To do so I was reading the documentation, and I would like to understand better what are
media_mix_model.trace["media_transformed"]
media_mix_model.trace["coef_media"]
media_mix_model.trace[feature]
media_mix_model.trace["mu"]
It seems that mmm.trace[] is equal to mcmc.get_samples() ,from NumPyro documentation but I really struggle to understand these methods.
What they do behind the scenes? Why we recall them and how for example I can modify one media contribution value changing manually its coefficient?
Thanks for helping me
Hi fellow LightweightMMM user here. My 2 cents: Yes, the trace is the Monte Carlo posterior samples for the coefficients / transformed media channels. In numpyro, in the model, we name the sample points, providing a name (for the parameter), fn (to set the prior distribution). This name is then the dictionary key in the trace. Look at media_mix_model function in models.py.
If you want to fix the coefficients of a particular channel, once a model is trained, you can in theory overwrite the trace attribute, for the relevant dictionary entry.
media_tranformed - the samples for the transformed media variables. Each transform function is parameterised, so if your num_samples =100, when running the model, you end up with 100 samples for each transform function coefficient. LightweightMMM used a numpyro.deterministic function, that applies the transform functions (adstock, saturation effect), for each set of sampled transform function parameters (100, c) where c is the number of media channels, to each time instance. So you end up with (100, t, c), where t is number of training samples, c number of media channels. You have 100 samples of transformed media channels, for each channel, day.
It is these "transformed media", having simulated market effects like adstock/saturation, that are then multiplied by the "coef_media", which is just a very traditional coefficent (100, c) that is multiplied tot he transformed media. This is a linear additive equation, using tranformed media variables after all.
Likely if you want to alter the contribution of the channels, it is likely "coef_media" you want to fix. For example if I want to double the impact of TV, which is my channel at index 2 (my third channel), I would do something like: model.trace['coef_media'][:, 2] = model.trace['coef_media'][:, 2] * 2
@becksimpson Amazing I will try in the following days!
Hi @becksimpson @uomodellamansarda , I am new to working with MMM and MCMC, I was trying to understand how come when running the predict method on my training data, the results differ from those in media_mix_model.trace["mu"], to be more specific I am comparing the mean of both distributions. Thanks!
@pauagustin the media_mix_model.trace["mu"] is from the training data, so what the predicted target is from the model across the training data.
I believe when you call predict it appends your new data, that you supply to the predict function, to the training data (it stores this to the LightweightMMM object internally during fit), but only returns the predicted distribution for the new data target, that is assumed (if media_gap=0) to immediately follow your training data. It does this to model the carryover effects, from previous investment during training data to your predict, inference time data. A lot of times in production, you'd train on your training dataset, and be making incremental predictions on new unseen data. So the data passed into the predict function, is assumed to follow the training data, temporally.
So if you ran predict using your training data as input to the function, I believe it appended your training data to your training data, and while it does clip to only return predictions for the data supplied to the function, with carryover effects, your early predictions would have carryover from the previous training data. Therefore the means would be different.
Essentially how you are using .predict, is not how it is designed to be used . You shouldn't be passing in training data, only test data, assumed to follow the train data
@becksimpson Thanks! That makes a lot of sense
@becksimpson, Thanks for the explanation! I used clicks data in my model. Suppose I have 2 channels c1 and c2. Now suppose I want to obtain the change in my KPI for a particular change in c1 (say, my clicks in c1 increase by 50%). Then how can I obtain that? Is the change in KPI simply = 1.5 X clicks_old X B ? ('B' is the beta coefficient of c1)
@rajat-barve No there would likely be impacts from the saturation curves (performed by **exponent for 'adstock'/'carryover' model, hill transform for 'hill_adstock' model).
One option is to run mmm.predict(..) with the channels (c1, c2) as they were in training, then (c1 * 1.5, c2). Then take the difference in the KPI predictions. Both predictions should have carryover/adstock from the training data, as LightweightMMM at prediction time, appends any data you pass the predict function to the training data, but I believe those should be the same in both runs, and we only care about the difference in the KPI, so they'll largely cancel out. So the only difference in output is caused by the difference in KPI caused by the difference in C1. Make sure to inverse target transform the predicted KPI difference.
This won't be exactly right, as carryovered clicks impacts how the saturation function transforms the passed in clicks, as it changes their scale. Another, slightly more correct, option is to prepend a sequence of zero clicks, before your click data you're passing in [[0,0, .. 0, *c1], [0, 0, .. 0, *c2]]. This gives time for the carryover effects of your training data clicks to 'die off', so say 90 days, then when you take the predicted KPI, you take it from the 90th index.
Thank you, @becksimpson ! Can you please also help me with this doubt -- as you might know, by default, the media priors are assumed to follow halfnormal(std) where std is governed by our input data. I have 3 media channels, and I have strong beliefs that one of the media channels should follow a beta distribution. Is there a way to change the prior on only one of the media channels? And also from halfnormal to beta?
Line 358 models.py:
coef_media = numpyro.sample(
name="coef_media",
fn=dist.HalfNormal(scale=coef_media * normalisation_factor)
)
It doesn't support changing media prior distributions out of box.
I would recommend doing something a lil hacky if you want this.
If you need HalfNormal, and Beta, I would sample each in turn, using coef_media[0], coef_media[1] as the values etc, but change the name reference to 'coef_media_0' etc in the numpyro sample function.
You could then, use numpyro.deterministic to re-create the concatenated coef_media sample that the rest of the MMM will need to work with the established API.
Something like this:
coef_media_0 = numpyro.sample(
name="coef_media_0",
fn=dist.HalfNormal(scale=coef_media[0] * normalisation_factor)
)
coef_media_1 = numpyro.sample(
name="coef_media_1",
fn=dist.Beta(....fn(coef_media[1] * normalisation_factor))
)
coef_media = numpyro.deterministic(
name='coef_media',
value=jnp.array([coef_media_0, coef_media_1])
)
This might work but obviously haven't tested. Although my recc would just be to change them all to Beta's, as otherwise you're biasing one paid media channel to have low 0-ROI probability compared to the rest. The whole enterprise of MMM is to assess channel effectiveness from the observed data so it feels off to give such a strong bias to one of the channels for their ROI. You'll also have to think about what the scale means for HalfNormal vs how you transform media_prior to create a suitable alpha/beta for the Beta distribution. When I was experimenting with Beta distributions for my media priors I used the fact that a / a+b = mean, and that a=1.5 gave a curve I found realistic for Media Effectiveness, to transform my media priors to suitable beta values with a=1.5.
alpha=1.5
beta = alpha * (1 - mean) / mean
dist.Beta(concentration1=alpha, concentration0=beta)
Thanks @becksimpson !
One thing that I wanted to try is de-mean my data to remove unobservable time-fixed effects. This naturally results in negative media data (impressions, in my case). To correct for this, I added a constant (equal to the minimum of all minimum values across all channels) to all the (media) data points in my dataset. Analytically writing MMM equation, this would imply a constant number which should get absorbed in the intercept. However, the default intercept distribution is a Halfnormal(2). I am not sure how to change this prior to take into account the data transformation that I am doing on the media columns. Any suggestions around this?
On another related point, I wanted to build an intercept-free model so I changed the HalfNormal(2) of the intercept to HalfNormal(0.0001) (it errors out on using HalfNormal(0)). I thought this should work to eliminate the intercept entirely. But the results look weird with this custom prior. Any thoughts why that might be the case?
@rajat-barve That scaling might run you into trouble. Adding a bias like that across all channels, could cause a channel's transformed (scaled) value to be non-zero when it is zero in original scale.
I'm of the impression the two proposed forms of scaling are MaxScaling (the highest value of the scaled channel is 1, used by pymc-marketing) or meanScaling (used by LightweightMMM in examples, the mean of the scaled channel is 1)
We do this to avoid a situation where a zero-valued channel is said to be generating registrations/KPI. Also we may want to project to lower levels of investment when we examine saturation curves. However if you're passing in 0 for scaled-media, when the media is positive (which would be the case for MINMAX scaling for example), that means the scaled-media will become negative when you drop below this historical min, you would be predicting negative registrations for a lower level of investment. Which violates our assumptions of how ads work. It's important that scaling method chosen preserves 'real world truths' that we want to encode into the model, i.e. a level of ad investment should not drive negative KPI.
What do you mean by unobservable time-fixed effects? Scaling for model inputs here is just meant to bring the domain of inputs to understandable and comparable ranges to that of outputs, to make parameters easier to fit. In frequentist space, it's typically so that weights are proportional to a feature's significance, regularization terms are fair across features etc, And for bayesian so I can more easily pick prior distributions that make sense given the relative scales of inputs and outputs.
Your best bet for an intercept free model is to replace the numpyro.sample for intercept to a numpyro.determinstic statment.
intercept = numpyro.deterministic(
name=_INTERCEPT,
value=0.0
)
Although I would recommend against this.
Thanks, @becksimpson. I want to get the "media_transformed" values of my media impressions. I am using the hill-adstock model. Is there a way to get transformed value of the impressions without fitting the model? mmm.trace['media_transformed'] won't do because I believe these are values based on the posterior distribution of the K, S and Lambda hyperparameters. Is there a way to get media_transformed values using the prior distribution?
@rajat-barve I believe you can run a prior predictive. You basically run infer.Predictive, but passing in no posterior samples. Initialize your mmm. This gets 100 samples from the prior distributions. It is something like this: (this may be slightly wrong as I took this from my adapted codebase, but
numpyro.infer.Predictive(model=mmm._model_function, num_samples=100)(
rng_key=jax.random.PRNGKey(seed=2),
media_data=media_data,
doms=doms,
extra_features=extra_features,
media_prior=jnp.array(media_prior),
degrees_seasonality=mmm._degrees_seasonality,
frequency=mmm._seasonality_frequency,
transform_function=mmm._model_transform_function,
weekday_seasonality=mmm._weekday_seasonality,
custom_priors=custom_priors,
target_data=None
)['media_transformed']
@becksimpson , Thanks! Also, do you use custom priors in your MMM? I need to enforce a custom prior on the coefficient of one of my media channels (say, because I have strong evidence on this channel's media contribution to revenue). Suppose it is 25%. Now from this I can backtrack and arrive at what should be the mean of my beta coefficient. And from that, I can arrive at what should be the "scale" parameter in the HalfNormal distribution that is assumed on this beta. However, this new custom mean (and hence scale) is way smaller than the default mean (and scale) which is based on the sum of of the total spend in each media channel (as per LMMM documentation). I was kind of expecting this because the default way of using the total spend to decide the spread of HalfNormal seems kind of random to me. Nonetheless, with the smaller custom priors(i.e. scale), the media contribution that the model gives me, instead of being bumped up closer to 25%, is even further away than the default priors. So I was curious to hear your thoughts on this.
@rajat-barve Not necessarily that surprised, signals are likely very multi-correlated, if you applied a custom prior to a single channel, that indicated to the model it is of a lower scale than the rest, regardless if it is closer to the truth, it will likely get less credit. HalfNormals are also quite zero biased as you might imagine. Which I do believe as a fairly low informative prior is what we want in general. Something to consider is that depending on the model, there can be a scale-changing impact due to the learned saturation effect. I agree 1-centered media priors as chosen by the docs are generous, probably due to the halfNormal priors it is not so big a concern, the most important thing is that the relative scales between media channels prior beliefs of contribution are preserved. In both pymc-marketing, and lightweightmmm I see people use HalfNormals, but I know of individuals who use Beta's to encode stronger prior beliefs of media contribution. I personally for my media priors use something more akin to dividing the media priors so they sum to 1.0, rather than they have a mean of 1.0 as shown in the docs.
However you inflate your prior belief in that channel x you have strong belief for 25% contribution, it is important the other channels are scaled appropriately. Otherwise you end up encoding that channel is contributing less than the rest. You could do something like: Set Channel x Prior = 0.25 (25% of 1-centered channel x --> 1 centered target) What total percent of your target is do you believe is media driven? 80%? (0.8 - 0.25) / (media channels not including channel x spend sum) * (media channel spends not including x)
Now if due to the HalfNormal priors these lead to very low media creditation, you can just scale up all channel media priors proportionally. Now you've encoded that 25% credit of target goes to channel x, you expect all channels drive 80% of credit, and you split the remaining 55% of expected credit across the other media channels according to spend level. Say [0.25, 0.1, 0.15, 0.3].
You can even scale up these media priors you get, so they sum to the total you would have gotten using the LMMM default method, where you just 1-center normalize spends and multiply by 0.15.
Thanks! @becksimpson, Couple of questions. To make the discussion more concrete let's assume some preliminaries: Suppose, I have three media channels A, B, C. I run the default LMMM. The contribution comes out as 20% (A), 35% (B), 25% (C), 20% (baseline). Now, I believe that B is actually around 42%. You will find further comments inline below starting with 'RB.'
@rajat-barve Not necessarily that surprised, signals are likely very multi-correlated, if you applied a custom prior to a single channel, that indicated to the model it is of a lower scale than the rest, regardless if it is closer to the truth, it will likely get less credit. HalfNormals are also quite zero biased as you might imagine. Which I do believe as a fairly low informative prior is what we want in general. Something to consider is that depending on the model, there can be a scale-changing impact due to the learned saturation effect. I agree 1-centered media priors as chosen by the docs are generous, probably due to the halfNormal priors it is not so big a concern, the most important thing is that the relative scales between media channels prior beliefs of contribution are preserved. In both pymc-marketing, and lightweightmmm I see people use HalfNormals, but I know of individuals who use Beta's to encode stronger prior beliefs of media contribution. (RB: doesn't the support for a beta lie between 0 and 1? Wouldn't that severely restrict the media beta coefficients? I would have thought a Gamma might make more sense. Also, do you know how to change the distribution altogether? How does one change from Halfnormal to Beta?)
I personally for my media priors use something more akin to dividing the media priors so they sum to 1.0, rather than they have a mean of 1.0 as shown in the docs. (RB: I never really understood the reason behind 1-centering the priors (or putting any other restriction, like the one you use). Could you help explain? What does it really mean to say "the average variation of all my media spending is 1?" OR in your case "the sum of the variation in all my media spending is 1"
However you inflate your prior belief in that channel x you have strong belief for 25% contribution, it is important the other channels are scaled appropriately. Otherwise you end up encoding that channel is contributing less than the rest. (RB: I am confused here; if I am inflating the prior (basically inputting a higher value for the 'scale' argument in Numpyro.distributions.HalfNormal(scale)), why do you say it is equivalent to "encoding that channel is contributing less?" I thought it's the opposite in fact.) Regarding scaling the other channels appropriately, I am doing a fit_transform on the new media_priors after updating the custom priors, again. So after updating with custom prior, the average of the priors is 1 again. But really I was a little skeptical about doing this fit_transform again. Because consider this: suppose my new belief about the said channel B was obtained from doing an experiment. In this experiment I made a lot of effort to ensure that only channel B changes and A and C remain at their original spend level (or 'default' level). Now, suppose the 1-centered media_priors in default were [A --> 1.2, B --> 0.8, C --> 1]. After the experiment, suppose the belief of 42% contribution of B corresponds to the prior being 1.5 (instead of 0.8 as in default). So what I wish to do really is set the priors as [1.2, 1.5, 1]. That is, NOT scale the media_priors. But if I scale them, the new 1-centered custom priors become [0.97, 1.22, 0.81]. So my question is that if after all the priors for the other two channels A and C also end up changing, isn't this equivalent to having NOT controlled for A and C in my experiment? That is exactly the opposite of what I struggled really hard TO DO in my experiment, i.e., change only B and NOT change A and C. Your thoughts?)
You could do something like: Set Channel x Prior = 0.25 (25% of 1-centered channel x --> 1 centered target) What total percent of your target is do you believe is media driven? 80%? (0.8 - 0.25) / (media channels not including channel x spend sum) * (media channel spends not including x) (RB: I did not understand this part. Can you please convey your idea with a little more detail)
Now if due to the HalfNormal priors these lead to very low media creditation, you can just scale up all channel media priors proportionally. Now you've encoded that 25% credit of target goes to channel x, you expect all channels drive 80% of credit, and you split the remaining 55% of expected credit across the other media channels according to spend level. Say [0.25, 0.1, 0.15, 0.3].
You can even scale up these media priors you get, so they sum to the total you would have gotten using the LMMM default method, where you just 1-center normalize spends and multiply by 0.15.
@rajat-barve
- Yes, Beta constrains the effect to between 0 & 1. For many this isn't an issue. If your target signal is 1-mean scaled, and even if after saturation effects your predictor signals (paid media, organic) are 1-max scaled (as is the case post hill-transform) provided you have either significant intercept, trend effects, or many contributing media channels, it is unlikely that any would need to fit something > 1 for a media posterior. I have run multiple MMM's, where the largest single "media posterior" is 0.3 for example. This might not be your case, and in which case you can use a Gamma. Line 348 models.py. You need to change this.
coef_media = numpyro.sample(
name="channel_coef_media" if media_data.ndim == 3 else "coef_media",
fn=dist.HalfNormal(scale=media_prior))
- The purpose of the scaling is just to put prior beliefs in reasonable ranges for the purpose of model convergence and so their total contribution is in a reasonable range, while injecting prior belief about what the relative contribution of channels are. What you mostly care about is that if channel a is expected 3x higher than channel b, then its prior belief is 3x higher. You can do any scaling you want really, you're just directing half normal distributions. In the LMMM example I believe priors are divided so their mean is 1, then multiplied by 0.15. That's almost saying in the absence of saturation effects (ignoring the 1-max hill saturation, or using an exponent saturation that applies a 1.0 exponent), each paid media signal would on average drive 15% of the 1-mean target signal. I prefer to do a similar thing, but divide by the sum, so I'm in that sense when I multiply my priors by a constant, say 0.5, I'm saying I believe paid media drives in total on average 50% of my target signal.
3/4. Honestly if you're trying to encode a specific % contribution to revenue of your channel, and have only one experiment result for one channel. Don't bother fucking around with 1-mean scaling. Look let's say you believe 42% of your revenue is driven by B. Great let's ignore saturation effects, b media prior = 0.42. Now what total percent of your revenue do you believe is driven by all paid media. Is it 50%, 80%?? We're not looking for exact here, no-one knows this, we need ballpark. Let's say 80%. Okay so 80 - 42 = 38% of your revenue you believe is driven by your other channels. Now split that 0.38, across your other channels by their spend ratios, as a proxy for what you think their relative contribution is, in the absence of any other data. You now have your priors, say [0.13, 0.42, 0.25], if channel 'a' has 50% of the spend of channel 'c' for instance. We're done. Run some experiments. If due to saturation effects these priors are too low, and you find consistently all posteriors are right of their priors, this is a bad sign. Increase all media priors by a multiplier. Re-run. etc
There's no single answer to what your media priors should sum to, how paid media contributes to target varies significantly across industries. DTC Companies infamously extremely high percentages of their new customers are driven by ads alone, bad from a business perspective, great from a data signal perspective. Mortgage brokerages, banks probably have a lot more WoM, so expect lower contributions of paid media to target KPI's.