Ax icon indicating copy to clipboard operation
Ax copied to clipboard

[FEATURE REQUEST]: Pool-based learning within Ax (i.e. allow Ax to sample from a pre-selected finite set of feasible arms only)

Open CompRhys opened this issue 7 months ago • 10 comments
trafficstars

Question

In many applications in material science we can end up with discrete design spaces (canonical example being the SMILES strings for molecules in a catalogue such as ENAMINE real).

Currently for software engineering purposes I want to try stick within the Ax ecosystem to keep the interfaces that other software engineers need to consider consistent then the only way I can see to do a pool-based screening campaign would be to swap to using an ExternalGenerationNode and then make the screening library something that is entirely encapsulated within the Node.

If the pool is just a discretization of a continuous space then this is all fine but in the example given with string serializations of Molecules then the search space cannot be defined given the current Parameters. In this instance adding a continuous StringParameter would then allow this pattern.

Are there any other considerations to doing pool-based learning within Ax that I might have missed? This probably isn't an intended usage pattern but it would be nice to be able to do it.

Code of Conduct

  • [x] I agree to follow this Ax's Code of Conduct

CompRhys avatar Apr 21 '25 17:04 CompRhys

Hi @CompRhys -- I think you're right that this will be a bit tricky of a usage pattern to implement, but there may be something we can do. I'm not quite sure I understand the structure of this problem, could you help by providing some examples of what the proposed solution might look like?

In this instance adding a continuous StringParameter would then allow this pattern.

Is the idea here that we could construct a 1D search space where the single dimension can take on any SMILES string, without necessarily capturing any information about order or "closeness" via the search space structure?

If this is true, and we dont actually need to capture any structure about the problem on the Ax side a suitable (albeit hacky) solution might be to simply construct an integer range parameter with a large range and write a function f that maps from SMILES string to integer and vice versa. This would allow Ax to deal in ints, then the ExternalGenerationNode would translate the int to a SMILES string, the screening library would operate on the SMILES string, and finally the ExternalGenerationNode would translate the SMILES string back to an int for Ax. Obviously this is not a great setup, but if I'm understanding the problem correctly it may help you get unblocked for now.

mpolson64 avatar May 02 '25 18:05 mpolson64

This isn't pressing for me but just forward looking as I try make sure Ax will work for most the problems I imagine the team might ask me for solutions to.

The more I think about the hacky solutions the less nice they feel and then the right answer would be to just not use Ax there. A more compelling pool-based setup might be a second Experiment class where all the allowed arms are pre-defined. The generation strategies would then do Thompson-esque sampling for large pools or eval every point if there aren't a large number of options. Seems like a big lift to do something like that.

CompRhys avatar May 02 '25 20:05 CompRhys

The string parameter could still be useful for optimisation based approaches as there are generation strategies being pushed for text-serialisable problems such as molecule properties (smilies) and protein design (amino acids) that combine a text based generative model with surrogate based ranking and acquisition function evaluation.

Again begs the question of whether this is something that makes sense to have in Ax or whether more specialised packages do serve these functions much more directly. My survey led me to believe that Ax was the most compelling framework and I do not believe these problems are really better managed in BoFire or BayBe which are most chemistry/bio focused higher level frameworks on top of BoTorch.

CompRhys avatar May 02 '25 20:05 CompRhys

Is the idea here that we could construct a 1D search space where the single dimension can take on any SMILES string, without necessarily capturing any information about order or "closeness" via the search space structure?

To circle back yes this was the thought process. If you didn't want to have a categorical parameter where everything was enumerated up front.

CompRhys avatar May 03 '25 17:05 CompRhys

The string parameter could still be useful for optimisation based approaches as there are generation strategies being pushed for text-serialisable problems such as molecule properties (smilies) and protein design (amino acids) that combine a text based generative model with surrogate based ranking and acquisition function evaluation.

This is a interesting thought. I guess in this case we wouldn't necessarily be in a pool-based setting, right? Many of these approaches are based on optimization in the latent space of some model, so conceivably such setup could also generate new string representations that we wouldn't necessarily have considered. In that case we do have a problem though since we don't have a representation of a string parameter with non-enumerated values.

In the other setup, what is your concern with the following?

If you didn't want to have a categorical parameter where everything was enumerated up front.

Balandat avatar May 06 '25 05:05 Balandat

@CompRhys, maybe relevant (and would help if you could clarify what matches / doesn't match relative to these topics)

  • https://github.com/facebook/Ax/issues/706
  • https://github.com/facebook/Ax/issues/771

Trying to understand if pool-based learning is the same as what I typically refer to as optimizing over a set of predefined candidates. Likewise, I'm understanding this to be from a plumbing standpoint of how to featurize the SMILES strings (traditional ML definition of "feature") according to whatever featurizer that you describe as opposed to something fancier like passing gradients directly / jointly training with a PyTorch-based latent space model (e.g., similar to VAE / MNIST / BO example and what Max referred to at the end of his comment).

sgbaird avatar May 23 '25 03:05 sgbaird

Trying to understand if pool-based learning is the same as what I typically refer to as optimizing over a set of predefined candidates.

Yes, same idea

Likewise, I'm understanding this to be from a plumbing standpoint of how to featurize the SMILES strings (traditional ML definition of "feature") according to whatever featurizer that you describe as opposed to something fancier like passing gradients directly / jointly training with a PyTorch-based latent space model (e.g., similar to VAE / MNIST / BO example).

So in-principle Ax is written in such a way that if I use an external generation node I can do whatever I like inside that node so long as the parameter set returned can be expressed in terms of Ax parameter primitives. Inside that external generation node I can do whatever I like such as have the MNIST VAE or in special cases if we can write a clean botorch model we can use the dispatcher logic inside Ax to add custom model training @fit_botorch_model.register(MyHomeBrewBotorchModel). The line of questioning is kind of asking how far is it reasonable to push the Ax API before really an alternative framework should be used.

what Max referred to at the end of his comment

I don't think there are actually any issues with this approach for a pre-defined smilies pool. However if you wanted to start adding additional features but still restrict to a limited subset then it could quickly become limiting.

CompRhys avatar May 23 '25 13:05 CompRhys

The line of questioning is kind of asking how far is it reasonable to push the Ax API before really an alternative framework should be used.

It seems to me that if using ExternalGenerationNode (EGN) works well enough, it's a good path as it lets you continue to leverage Ax as a consistent interface. We added it largely for this purpose, to let folks use other libraries / algorithms from within Ax.

Let us know if more help would be useful, @CompRhys!

lena-kashtelyan avatar May 28 '25 20:05 lena-kashtelyan

I think the main thing that might be thought of as missing that may allow more edge case uses via the EGN is to have an unconstrained StringParameter and then you can do pretty much everything you might want via the EGN setup.

In terms of the original question then a solution does exist for certain bool based setups like SMILES strings through just having a very large categorical set. If you have multiple inputs and the pool consists of a subset of the allowed combinations then I think there's not a great solution apart from defining the full space in the Ax search space and then having logic in the EGN that only considers the allowed subset.

I guess it makes sense to close this discussion and then maybe make another issue specifically about the StringParameter?

CompRhys avatar May 28 '25 21:05 CompRhys

A choice parameter with every possible string makes sense to me for representing strings (unless it is prohibitively long). One could model the metrics in terms of the strings however they wanted with a custom model. For handling combinations of multiple categoricals, I think it would make sense to support passing in a whitelist of allowed combinations (arms) at generation time.

sdaulton avatar May 29 '25 16:05 sdaulton

Perhaps related: https://github.com/experimental-design/bofire/discussions/635#discussioncomment-14571398 (e.g., using a genetic algorithm to optimize the acquisition function, which would presumably be part of an external generation node?)

sgbaird avatar Oct 03 '25 04:10 sgbaird