xlogit icon indicating copy to clipboard operation
xlogit copied to clipboard

Enabling Non-Linear Utility Functions - WTP Space

Open crforsythe opened this issue 2 years ago • 39 comments

Love seeing this new contribution to choice modeling software! Very impressive work. I was wondering if you had any intention to expand xlogit's ability to allow for nonlinear utility specification, such as WTP space estimation a la Train and Weeks (2005)?

crforsythe avatar May 06 '22 17:05 crforsythe

HI @crforsythe, I am so glad you like xlogit! Yes, it would be awesome to implement non-linear utility specifications in xlogit. I have not extensively worked with non-linear utility models, but I should be able to implement it with some help. Do you have any suggestions on good resources (e.g., implementations and datasets) that I can use as reference for non-linear mixed logit specifications in WTP space? Also, if you have some pseudo-code or basic implementation script for this, I would be glad to provide guidance on how to integrate it into xlogit or I could integrate it myself.

arteagac avatar May 08 '22 03:05 arteagac

@arteagac Thanks so much for the quick response! I would recommend first looking at jhelvy's R implementation logitr. However, I believe he has only implemented normal and log-normal coefficient distributions for his mixed logit models. Additionally, the Revelt and Train (1998) provide gradient and LL formulations for generic utility forms. So far as datasets, logitr provides datasets that can be used as a benchmark as well as several examples in Apollo's documentation.

I would be happy to help you with the development if you'd like to discuss offline. My email address is [email protected].

crforsythe avatar May 09 '22 17:05 crforsythe

Fantastic! Thank you so much for suggesting those resources and for offering your help. I just emailed you.

arteagac avatar May 09 '22 18:05 arteagac

Hi @jhelvy. I integrated your probability and gradient formulation and the estimation is now around ~40% faster. It is impressive how your formulation significantly reduces the amount of computations. Nice job!

The only little issue I have is getting it to work for panel data. Could you please explain to me how to interpret the term highlighted in red below? I coded it as the product (across panels) of the probabilities for the chosen alternatives but it seems it is not correct. image

arteagac avatar May 14 '22 02:05 arteagac

Wow that was a fast implementation - 40% for a day's work isn't too shabby!

One other thing I forgot to mention that got me a huge speed boost was to do my iteration over the betas instead of the draws for the simulated log-likelihood. That is, when I pre-compute the (X_nt - X_n*), I also include the "standard" draws in that pre-computation and set everything up for each beta. Then later when I'm computing the LL probabilities, I don't have to write a loop to iterate over R (the draws) but rather I iterate over the number of betas. So my code looks a little strange as it doesn't exactly line up with how the equations are written. You can see how I do my pre-computes here

As for working out this panel data part, I remember this one being particularly tricky. Looking through my code for this isn't immediately helpful as I reorganized a lot of the calculations to pre-calculate as much as I could. The relevant parts in my code are here and here. For the first part, logitDrawsPanel <- exp(rowsum(log(logitDraws), d$panelID)), this is a faster way of computing what you've highlighted in red. rowsum() is a fast function in R that allows you to do grouped sums by row, so I'm summing across rows and grouping by the panelID. If I take the log first, then do those row sums, then exponentiate that, it's the same as doing a grouped multiplication. The second code chunk I linked to is where I compute the gradients differently based on the panel structure or not. Sorry this may not be super helpful, but I'm on the road. I'll get back to this Monday if I can.

jhelvy avatar May 14 '22 13:05 jhelvy

Yes, I am quite happy with the 40% speed gains. Thank you for sharing your valuable formulation to make it faster.

In xlogit, I do not iterate over individuals, betas, or draws. Instead, I express everything as multidimensional array operations and then I figure out the lowest cost contraction for common matrix operations. I think this helps speed up the operations.

Thank you for the explanation for the panel data part. I will try to make some room this week to implement it. I will let you know how it goes. After fully integrating your formulation, it should be relatively easy to add the WTP functionality.

arteagac avatar May 16 '22 16:05 arteagac

Fantastic. Btw, I will eventually add the math from the overleaf doc to the logitr documentation. I'm not sure if there's a convenient way in Python to mostly automate package documentation, but in R there's a great package called pkgdown that does most of the work for you. If there's something like this for Python that would be a great benefit to add and further publicize the package. If I get time this summer I might try and look it up myself as I'd like to learn more about Python packages.

jhelvy avatar May 16 '22 17:05 jhelvy

Yes! In Python, I use Sphinx to automatically generate the documentation. Also, I use Jupyter notebooks to create examples, and I further convert these to web page using Sphinx. For instance, see xlogit's documentation website below:

https://xlogit.readthedocs.io/en/latest/notebooks/mixed_logit_model.html

arteagac avatar May 16 '22 17:05 arteagac

Perfect. You should put that link in the main page under the "about" section!

jhelvy avatar May 16 '22 17:05 jhelvy

Currently, the main page has the following links (see screenshot below) to the documentation website, but I agree that it is a good idea to additionally create an "About" section to put a documentation link in a more explicit way.

image

arteagac avatar May 16 '22 17:05 arteagac

Oh I meant under the main "about" section on the github page, not the readme, like how I have it on the logitr page:

logitr

jhelvy avatar May 16 '22 18:05 jhelvy

Oh I see. Yes, it is a great idea. I just added it. Thanks for the suggestion.

arteagac avatar May 16 '22 18:05 arteagac

Hi @jhelvy, I managed to get the panel data part to work. Now I am having a little challenge estimating fixed (non-random) coefficients, so I figured I would ask in case you can help me figure it out faster. For instance, assume that in the Yogurt dataset, I want feat and brand as random coefficients, but I want the price as a fixed coefficient. Which of the following gradient functions would you use for price?

image

I assume I have to use the gradient for MixedLogit, because even though the price coefficient is not random, the choice probability $P_n*$ involves computation across random draws. Is it right? I tried implementing it this way but the estimation fails to converge. Do you have any additional comments or recommendations regarding this?

arteagac avatar May 23 '22 17:05 arteagac

Yes you have to use the mixed logit gradient. What I do in logitr is set up my "standard" draws (halton draws by default) to just be all 0s for the fixed parameters. When I create the draws of the beta coefficients in mixed logit, I shift the standard draws like this: draw * sigma + mu. So if the draw is 0 for fixed parameters, you end up with only the mu parameter (the fixed parameter). Here's the relevant function for shifting the draws. Does this make sense?

jhelvy avatar May 24 '22 01:05 jhelvy

I see. Does this mean that you use the same gradient computations for fixed and random coefficients with the only difference being that you set the random draws as zero for fixed coefficients? I tried this approach too, but I am still struggling to reach convergence. Perhaps it is something else in my code. The fixed coefficients part is supposed to be the easiest one, but it is giving me the hardest time :sweat_smile:. I will get back to you once I figure it out.

arteagac avatar May 24 '22 17:05 arteagac

Yes, if there are any random parameters, then I use the mixed logit gradient for all parameters with a zero for the draws for the fixed coefficients. The math should work out. Like, if you had 100 draws for each of the parameters, then you'll end up with 100 of the same number each of the fixed parameters (just the mu term, no sigma), and 100 different numbers from different distributions for the random parameters (draw*sigma + mu). It's probably a small bug somewhere because I know this approach works. Most of my bugs come from messing up the scaling somewhere (I scale all the data before doing the optimization). You could also try computing the gradient numerically to see if your analytic gradient is producing the same result to try and narrow in on the source of the issue?

jhelvy avatar May 24 '22 19:05 jhelvy

Yes, I am actually surprised too because the math works and makes total sense, so I also suspect it's some weird bug. Yes, by looking at the gradients, I can see that those for the fixed coefficients are excessively high, so I might be missing some scaling steps. I will keep trying and keep you posted.

arteagac avatar May 24 '22 21:05 arteagac

I finally figured it out. The issue was because, while computing the gradient, I was doing the operation $V_{nj} - V_{n*}$ separately for fixed and random coefficients. However, such an operation needs to include both types of coefficients. Now it works like a charm, I will clean up the code and release a new version with the updated formulation in the next few days.

arteagac avatar May 27 '22 08:05 arteagac

Woohoo! That's great news! I figured it was a bug somewhere.

Btw, do you create your draws of the coefficients the same way? As I said before, I create halton draws and then shift them into a normal with drawsigma + mu, or exp(drawsigma + mu) for log-normal. I want to add support for other distributions in logitr, but haven't gotten around to it yet. I assume a similar procedure can be used for truncated normal, triangular, etc., but I haven't sat down to figure it out yet. If you've already worked this out it would certainly help me implement it faster in logitr. As this whole exercise shows, there are always different ways to do something, and often times the faster approach is sometimes less intuitive.

jhelvy avatar May 27 '22 10:05 jhelvy

Yes, fortunately, it was easy to fix. Yes, I also create and shift the Halton draws using the draws*sigma + mu strategy. Adding extra distributions to logitr should be very easy, as you simply need to use the formulas below to shift the draws to the desired distribution. Just have in mind that this also affects the derivative (gradient), but you can find all the formulas in the xlogit paper and I would be more than glad to discuss further or answer any question you may have.

image

arteagac avatar May 27 '22 11:05 arteagac

Woo hoo, @arteagac ! Glad to see it's all working. I look forward to testing out this new feature.

crforsythe avatar May 27 '22 15:05 crforsythe

Hi @crforsythe and @jhelvy ,

I just released the version of xlogit that includes John's formulation. The speed gains are significant ,as described in the email I sent you. I plan to update the benchmarks once we finish the WTP implementation, which is the next item in my to-do list. I will keep you both posted about this.

arteagac avatar Jun 01 '22 08:06 arteagac

I think I got the first version of WTP to work. The only issue is that my BFGS implementation can't reach convergence, but Scipy's L-BFGS-B does reach convergence and the final coefficients and log-likelihood are very similar to logir's. I have some questions for you guys:

  1. What start values do you use for the lambda coefficient?
  2. Is it normal that standard BFGS does not work in this type of non-linear estimation but L-BFGS-B does work? Do you think we need to permanently use L-BFGS-B for WTP models?

arteagac avatar Jun 01 '22 11:06 arteagac

I always start lambda at 1. Anything else makes it much harder to converge as everything is scaled by lambda, so 1 makes for a more reliable starting point.

I use the "Low-storage BFGS" from NLopt (via the nloptr package) for all models, which isn't bounded. Your routine should probably work if you start with lambda = 1.

jhelvy avatar Jun 01 '22 12:06 jhelvy

Perfect, I also used lambda = 1. I suspect that standard BFGS will not work for this type of non-linear estimation. I tested my implementation and Scipy's implementation of BFGS but none of them managed to reach convergence. On the other hand, scipy's L-BFGS-B (the same low-storage BFGS that you use) always reaches convergence. I think I will either provide a mechanism for the user to select the optimization routine or set L-BFGS-B as the default for WTP models. I prefer not to use L-BFGS-B for all models, as it makes the estimation much slower.

arteagac avatar Jun 01 '22 18:06 arteagac

Yeah, this is a tradeoff. WTP space models have a harder time converging in general because of the scale factor. I think the search basically "bounces around" a lot and is highly sensitive to the gradient on the lambda term. This was one of the motivations for including a multistart in logitr as often times I'd have to try 50 different sets of starting points to get 10 or so that converge. But even in the multistart search I still always use 1 for lambda.

It could work for now to just set the defaults to use your BFGS but switch to L-BFGS-B for the WTP space models.

You could also try using NLopt for python. I don't know about the scipy algorithm but I know that NLopt allows you to specify a single function that returns both the objective value and gradient, so you can still share a lot of the same calculations in each iteration for speed. This is what I do in logitr. And NLopt is pretty well-established and maintained so I'm not too concerned about it as a dependency.

jhelvy avatar Jun 01 '22 19:06 jhelvy

Yes, scipy's algorithm also offers the option of having a single function to return both the objective value and gradient, which reduces computations to a high degree. I will take a look at NLopt for Python in case scipy's function struggles to converge.

I also wanted to ask you and @crforsythe whether you have any suggestions about naming for the parameters and functionality. In a previous conversation with @crforsythe, I mentioned that it would be good to keep the naming in a way that is not hardcoded for WTP models only, but that also works for other similar models (willingness to wait for instance). Therefore, we mentioned that perhaps we could name the price input as scaling_factor. Also, I was wondering whether we should call this functionality WTP space or if there is a more generic name that may apply to other types of models with similar utility specifications. Do you have any suggestions about the naming of the parameters and the functionality?

arteagac avatar Jun 01 '22 19:06 arteagac

Yes, I agree it might be better to avoid hard-coding in the "willingness to pay" language as it is just one way to scale the utility (though it's by far the most common). "WTP" does have a large body of literature behind it in that it is frequently referred to as WTP, and pretty much every paper that does this sort of scaled utility does so in the "WTP space". But I like calling the argument scaling_factor as it does leave the door open to other types of scaling. It must be a continuous variable though, so price and time make sense.

jhelvy avatar Jun 01 '22 19:06 jhelvy

Perfect. I have an additional question. To provide extra flexibility, should we allow to independently use the scaling_factor and lambda multipliers or these should be tied together? For instance, some people may want to scale their utility function but not include a lambda coefficient. At the same time, some people may want a the lambda coefficient without scaling the utility. This could be achieved by having two parameters: scaling_factor and lambda_coeff (True or False).

arteagac avatar Jun 02 '22 18:06 arteagac

I'm not sure I understand the question. For a WTP space model, the utility model is by definition scaled by the lambda coefficient:

$$u_j = \lambda ( \omega X_j - p_j) + \varepsilon_j$$

So whatever you use for $p_j$ is the variable for the scaling_factor. For WTP space, you would set the scaling_factor = "price". But you could just set it to some other variable, like "time", and the interpretation of the coefficients would be in units of time instead of currency.

jhelvy avatar Jun 02 '22 18:06 jhelvy