botorch icon indicating copy to clipboard operation
botorch copied to clipboard

Questions regarding contextual GPs / LCEMGP

Open saitcakmak opened this issue 4 years ago • 17 comments

I want to use the contextual GPs, in particular the LCEMGP, for a project. The existing documentation, i.e., the docstrings, do not present any examples, which makes it challenging to understand how to use these models with all their features. A tutorial notebook would be very helpful for me and others that may want to use these models in the future.

In particular, I'd like to understand how to use context_cat_feature, context_emb_feature and embs_dim_list, and the assumptions behind these. From the LCEMGP docstring, it seems like it is assumed that the contexts take values 0, ..., n_contexts-1, though it is never made explicit. Below are some thoughts & questions I had while going through the code.

Reading the source code, it becomes somewhat clear that the contexts are assumed to be integers (since it converts them to dtype=torch.long when setting self.all_tasks). Though, this may lead to a bug when (carelessly) using fractional context values as it later assigns context_cat_feature = all_tasks.unsqueeze(-1) where all_tasks still has the same dtype as the input. A simple input validation could alert the user to this mistake. This is also an issue with MultiTaskGP.

Another thing I noticed is that task_feature is int (presumably due to it being a subclass of MultiTaskGP), which implies a single task / context variable, whereas the treatment of tasks within the constructor is written allowing for multiple task indices, e.g., the treatment of self.emb_dims. I guess this could purely be intended to support context_cat_features when k >= 1. In this case, if I have 2D contextual variables, let's say with 3 and 5 categories per dimension respectively, would I feed this into LCEMGP by first flattening the 2D contexts into a 1D variable with 15 categories, then use the context_cat_features to convert it back to 2D? Is it implemented this way to allow for subclassing MultiTaskGP, or am I missing something important here?

saitcakmak avatar Jan 20 '21 03:01 saitcakmak

A far-fetched addition: Is there anything preventing me from using LCEMGP (or some variant of it) with purely categorical inputs? I know LCEMGP will run into issues when x_basic is a 0-element tensor. If I were to implement a variant that doesn't have this issue, is there any reason that I shouldn't use such a model?

saitcakmak avatar Jan 20 '21 03:01 saitcakmak

I'll let @qingfeng10 answer your specific questions regarding the inputs and assumptions, but in the meantime it may be helpful to look at how the model is hooked up in Ax: https://github.com/facebook/Ax/blob/master/ax/models/torch/cbo_lcea.py

Though, this may lead to a bug when (carelessly) using fractional context values as it later assigns context_cat_feature = all_tasks.unsqueeze(-1) where all_tasks still has the same dtype as the input. A simple input validation could alert the user to this mistake. This is also an issue with MultiTaskGP.

Sure, it makes sense doing an input validation here, thanks for the suggestion.

On a higher level, the fact that task indices are a single integer is somewhat a limitation of the existing MultiTaskGP. It would also be possible to set up an equivalent model that has multiple task features and then just defines two inputs to come from the same task if all their task features are equal. You could also define other task covariance structures, e.g. assuming additivity across the different task features and add up task-covariances estimated for both. There is no particular reason to do things one way or another in terms of the data input as long as one can define a meaningful covariance between points.

Balandat avatar Jan 20 '21 05:01 Balandat

@saitcakmak thanks for your interest and suggestions! I will improve documentation and include a BoTorch tutorial. Here is a quick one for LCEMGP based on your questions. Let me know whether that answers your questions or you have further questions.

LCEMGP_demo.zip

Add more to some of the questions here.

I guess this could purely be intended to support context_cat_features when k >= 1. In this case, if I have 2D contextual variables, let's say with 3 and 5 categories per dimension respectively, would I feed this into LCEMGP by first flattening the 2D contexts into a 1D variable with 15 categories, then use the context_cat_features to convert it back to 2D?

Contexts are the cross product of context categorial features. task_feature are the context indices. Use the 2D example you mentioned. task_feature are integers from 0 to 14 with the context_cat_features being [[0, 0], [0, 1], [0, 2], [0, 3], [0, 4], [1, 0], ..., [2, 4]]. The embedding layer will map each of the 2D features to 1-d embedding (we use emb_dims = 1 for each context cat feature by default).

You can also flatten the 2D contexts into 1D before feeding into LCEMGP ( you just need to keep the mapping and do the conversion your self). But one good thing to input 2D is that it provides additional info to the embedding learning inside LCEMGP. (same idea as training entity embeddings in NN)

A far-fetched addition: Is there anything preventing me from using LCEMGP (or some variant of it) with purely categorical inputs?

Besides the issue that x_basic being 0-element tensor, there is no blocker. I'm interested in knowing more about your applications. To support that case, it seems like there is no continuous parameters but pure categorical parameters.

qingfeng10 avatar Jan 20 '21 07:01 qingfeng10

Thanks for the quick response and the tutorial! This clarifies how to feed the contexts into the model and how to deal with multi-dimensional categorical variables.

I'm interested in knowing more about your applications. To support that case, it seems like there is no continuous parameters but pure categorical parameters.

I am working on a variant of the contextual bandit problem, where I have finite number of arms (categorical) and a finite or infinite number of contexts. The pure categorical scenario would be to support the setting where the contexts are also categorical variables.

saitcakmak avatar Jan 20 '21 15:01 saitcakmak

I see! Thanks for the info. That makes sense and should be pretty doable in LCEMGP. I can put up a PR for this if needed!

qingfeng10 avatar Jan 20 '21 16:01 qingfeng10

I can put up a PR for this if needed!

If you think the purely categorical setting is of broader interest, this could be useful. Otherwise, I can just modify the model offline.

I actually ran into another issue with LCEMGP. I am planning to use a custom look-ahead acquisition function, so I called model.fantasize(X, sampler) on LCEMGP. This defaults to Model.fantasize(...), https://github.com/pytorch/botorch/blob/026fd652bb9d6a52fbbe50e7e40840c0df14199d/botorch/models/model.py#L136-L140 which calls the posterior(X, ...) with X that includes the task index. The posterior of MultiTaskGP expects the X without the task index and a separate output_indices argument, so this leads to an error in there. I tried the condition_on_observations part with dummy input and that seems to work fine. So, I think all that is needed here is to over-write the fantasize method in MultiTaskGP and separate the task index before calling the posterior. I could put up a PR for this if that sounds like a good plan

saitcakmak avatar Jan 20 '21 16:01 saitcakmak

Thanks for calling this out and putting up the PR! I agree. We probably need to overwrite fantasize in MultiTaskGPyTorchModel to allow passing task index to posterior. cc @Balandat in case, he has other thoughts! https://github.com/pytorch/botorch/blob/026fd652bb9d6a52fbbe50e7e40840c0df14199d/botorch/models/gpytorch.py#L572

qingfeng10 avatar Jan 20 '21 17:01 qingfeng10

Sorry about the whole fantasize mess yesterday. I flagged the wrong issue due to my own misunderstanding then tried to fix it by bending the functionality to fit my expectations. The core of the issue was that I was thinking of LCEMGP or more generally MultiTaskGP like a single output model, so expecting it to fantasize over a n x q x (d+1)-dim X, which includes the task index. But since it is multi-output, it is supposed to fantasize over a n x q x d-dim X, which does not include a task index, by generating fantasy observations over all m outputs.

If all I wrote so far are correct, then there is still a bug in fantasize, this time raising an error in upstream code in ExactGP.get_fantasy_model(...). Minimal example below.

import torch
from botorch.models.contextual_multioutput import LCEMGP
from gpytorch.mlls.exact_marginal_log_likelihood import ExactMarginalLogLikelihood
from botorch.fit import fit_gpytorch_model
from botorch.sampling import IIDNormalSampler

train_X = torch.cat(
    [torch.rand(10, 1), torch.arange(0, 10).unsqueeze(-1)], dim=-1
)
train_Y = torch.randn(10, 1)

model = LCEMGP(train_X, train_Y, task_feature=-1)
mll = ExactMarginalLogLikelihood(model.likelihood, model)
fit_gpytorch_model(mll)

fant_x = torch.rand(1, 1)
model.fantasize(fant_x, IIDNormalSampler(5), observation_noise=False)
Traceback (most recent call last):
  File "/Users/saitcakmak/PycharmProjects/botorch_test/lcemgp_fantasize_bug.py", line 17, in <module>
    model.fantasize(fant_x, IIDNormalSampler(5), observation_noise=False)
  File "/Users/saitcakmak/.conda/envs/botorch_test/lib/python3.8/site-packages/botorch/models/model.py", line 140, in fantasize
    return self.condition_on_observations(X=X, Y=Y_fantasized, **kwargs)
  File "/Users/saitcakmak/.conda/envs/botorch_test/lib/python3.8/site-packages/botorch/models/gpytorch.py", line 196, in condition_on_observations
    return self.get_fantasy_model(inputs=X, targets=Y, **kwargs)
  File "/Users/saitcakmak/.conda/envs/botorch_test/lib/python3.8/site-packages/gpytorch/models/exact_gp.py", line 178, in get_fantasy_model
    raise RuntimeError(
RuntimeError: Unsupported batch shapes: The target batch shape (torch.Size([5, 1])) must have either the same dimension as or one more dimension than the input batch shape (torch.Size([]))

saitcakmak avatar Jan 21 '21 14:01 saitcakmak

@saitcakmak apologize for missing this update completely! Let me take a look today!

qingfeng10 avatar Jan 25 '21 17:01 qingfeng10

No worries! I am no longer using the LCEMGP model, so there's no urgency on my end. I instead implemented a single output version of it, using the same latent embeddings for the categorical variables. It is more suitable for my setting as it allows adding observations one at a time (rather than observations at each task). I'd be happy to share / upstream it if there's interest.

saitcakmak avatar Jan 25 '21 17:01 saitcakmak

I have been using a derivative of LCEMGP and ran into a strange behavior. Increasing the size of the training would sometimes lead to worse models. I suspect that this is due to the training of the embedding layer starting from the random initialization provided by torch.nn.Embedding. Based on the initialization, the resulting embedding weights differ significantly, and this leads to some non-negligible variability in the resulting posterior mean and covariance. Here is a notebook demonstrating this on LCEMGP: https://colab.research.google.com/drive/1j70i40CSxlZCoJDo7yRJqMMNVNbWtFS0?usp=sharing

My understanding is that the training of the embedding is a non-convex optimization problem, fit_gpytorch_model uses LBFGS to fit it locally, so the random initialization leads to a different local optimum each time. My question is:

  • Did you run into this issue, if so how did you deal with it?
  • What would be an efficient way of fitting the model globally, e.g., through multi-start optimization similar to optimize_acqf? I'm not sure what's the most efficient way of evaluating MLL using different (or a batch of) model parameters.

Any suggestion would be much appreciated! cc @Balandat, @qingfeng10

saitcakmak avatar Feb 09 '21 03:02 saitcakmak

I haven't personally run into this for this specific model, but I am not surprised this happens. Even just standard MLL optimization is a non-convex problem and depending on the data you can get the same behavior (though it's relatively rare, LBFGS-B seems to do a decent job if the parameters are initialized at the mode of their respective priors. Of course the embedding will exacerbate things.

For how to deal with this: Multistart optimization makes sense. You could either just loop, or you could try to exploit hardware parallelism by constructing a batched model, optimize the sum of the batched MLL (though unfortunately you'll likely run into this issue: https://github.com/cornellius-gp/gpytorch/issues/1318). Apart from that bug, one other consideration is that stacking all the optimization variables for the independent subproblems will lead L-BFGS-B in scipy to try to estimate a full Hessian (rather than a block diagonal one as would be the right thing to do).

So between these two challenges, it's probably easiest to re-run the fit from random restarts in a loop for now.

Balandat avatar Feb 09 '21 19:02 Balandat

Thanks @Balandat! I'll stick with the loop for now.

saitcakmak avatar Feb 09 '21 20:02 saitcakmak

Thanks @Balandat ! For some cases, I found using "fully-bayesian" inference can be useful. Specifically, what I used was Laplace approximation over the embedding weights.

Also like to get more info about "increasing the size of training". Is number of tasks fixed but just having more training points per task?

qingfeng10 avatar Feb 09 '21 20:02 qingfeng10

I found using "fully-bayesian" inference can be useful.

I'll keep that in mind. I haven't experimented with fully-bayesian inference much but it sounds like a good alternative.

Also like to get more info about "increasing the size of training". Is number of tasks fixed but just having more training points per task?

Yes, that's exactly the scenario. In my test code, I train the model on observations from a synthetic MVN instance. Sometimes, using 10 observations (per task) produces a better model than using 100 observations (better in terms of predicting the task with the largest posterior mean) (to be fair, this is a positive probability event but it was happening way too often). This particular test is using a single task version of LCEMGP with only categorical inputs. I haven't tested for this in LCEMGP, so I don't know if it happens there as well. In my case, fitting the model several times in a loop and picking the best helped with this issue.

saitcakmak avatar Feb 09 '21 21:02 saitcakmak

Sorry about the whole fantasize mess yesterday. I flagged the wrong issue due to my own misunderstanding then tried to fix it by bending the functionality to fit my expectations. The core of the issue was that I was thinking of LCEMGP or more generally MultiTaskGP like a single output model, so expecting it to fantasize over a n x q x (d+1)-dim X, which includes the task index. But since it is multi-output, it is supposed to fantasize over a n x q x d-dim X, which does not include a task index, by generating fantasy observations over all m outputs.

Hi @saitcakmak, so what I imagined for a MultiTaskGP fantasize is the same as your "single output" version above because this would be useful for multi-fidelity optimization where you want to fantasize only a specific task (ie a particular fidelity). But the version where we fantasize over all tasks at once seems to also be potentially useful where one wants to do some sort of lookahead when observing data from multiple sources. I think it would be nice to eventually have both types of fantasize. What do you think? (cc @Balandat)

danielrjiang avatar Feb 18 '21 18:02 danielrjiang

Hi @danielrjiang, I agree that the ability to fantasize over both a selection of tasks and all tasks at once would be useful. Currently, LCEMGP does not support either one, fantasizing over single task is not working due to how things are handled in MultiTaskGP, and fantasizing over all tasks leads to an error as seen in https://github.com/pytorch/botorch/issues/667#issuecomment-764693459.

Here is a sample notebook demonstrating the model I've been using: https://colab.research.google.com/drive/1fxNImWO9a1yJgPU-ot_cwB5dK4YRTYNp?usp=sharing It is defined with the intention of being used as a single-task model but it also supports the multi-task operations with some gymnastics.

saitcakmak avatar Feb 18 '21 19:02 saitcakmak