baybe icon indicating copy to clipboard operation
baybe copied to clipboard

Best way to represent a feature that is a variable-length vector of integers

Open brandon-holt opened this issue 1 year ago β€’ 8 comments

Hi,

I am wondering what is the best way to encode a feature that is a variable-length vector of integers. Would this work out of the box?

If not, the way I would naturally do this is to find the datapoint with the longest vector for this feature, let's call this length N. Then I would define N features, where each element in the vector is a different feature. For datapoints with len(vector) < N, all features between len(vector) and N would be assigned a value of 0.

This seems not ideal though, since the number and identify of features depends on the data, which will change over time.

Is there a better way to do this, or is specific accommodation of this on the roadmap anywhere? Thanks!!

brandon-holt avatar May 08 '24 16:05 brandon-holt

Hey @brandon-holt, seems like an interesting use case you have. While I do understand what you try to achieve, could you perhaps provide some context what it is you try to represent? That is, why do you run into the variable-length issue in the first place?

In terms of functionality: I think in principle this could be handled well via the standard GP approach If you can provide a reasonable kernel for your situation to compute similarities between these vectors of different sizes. Of course, we'd probably need to extend the backend code a bit in order to enable that the variable-length data is properly passed down to the GP model, but I'd expect that is doable πŸ€” However, no immediately so, but I'd be happy to brainstorm once I understand the potential impact and context a bit better ✌️

AdrianSosic avatar May 08 '24 16:05 AdrianSosic

Just quick follow-up: had a few more thoughts and I can imagine that the mechanism can be rigorously implemented even with our current backend code, without much work actually. But before I promise too much, let's first hear what you have to say about the context πŸ˜„

AdrianSosic avatar May 09 '24 12:05 AdrianSosic

@AdrianSosic Oh that's great news!!

So for context, an important feature of the objects we are modeling is the structure, which is composed of sequential units of varying lengths. So for example, one object A might have 3 units (e.g., [5, 2, 9]) and another, B, might have 6 (e.g., [2, 2, 8, 4, 3, 1]), where the values in the list represent the length of each unit.

So from my message above if I was forced to move away from the vector representation here is how I would do it for this example dataset:

Object Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6
A 5 2 9 0 0 0
B 2 2 8 4 3 1

As I mentioned, the added 0's aren't technically wrong, but I don't like that the number of features and their identity will change depending on the specific dataset being used. It also just seems clunky haha.

brandon-holt avatar May 09 '24 19:05 brandon-holt

I see, thanks for the explanation. So I think the question of whether there is a more elegant way than appending zeros pretty much boils down to whether you are able to define pairwise similarities between your structures of unequal length. That is, would you be able to compute a meaningful similarity between, say, [7, 9, 2] and [1, 1, 8, 8, 3]?

I could imagine that for your particular setting (since you were attempt to compare "structures"), one could even simply resort to using graph kernels for the computation. If not, I'm confident that there are other concepts out there that could be adopted. I do need to admit, though, that I don't have a good overview of non-vector-space kernels, since I've never worked with them before. Would recommend to start with a brief literature review πŸ™ƒ

AdrianSosic avatar May 10 '24 06:05 AdrianSosic

Not going into the special kernels direction, but into anther one: In a sense this is similar to a slot-based mixture representation without invariance

For mixtures: enable anything between 3 and 6 solvents, their order doesnt matter.

For your case enable anything between 3 and 6 units, their order matters (so no invariance constraint).)

The aspect of variable length can be expressed with a DiscreteDependenciesConstraint but the full incorporation of that info is somewhat incomplete. For instance ,you can already specify that some parameter only matters if another one has certain values. You can use this to say "Slot_4_features" are only relevant if "Slot_4_active" is "1" or so - this is possible already now. It will remove searchspace entries that are irrelevant due to this and keep only one entry (ie only one of the possible values for "Slot_4_features"), but currently the redundance info is not reflected in the kernels.

Another thing that would need to be added is the info that the representation (Slot_1_active, Slot_2_active, Slot_3_active) = (1,0,1) is equivlaent to (1,1,0). This needs to be done both for data augmentation and kernels

@AdrianSosic we discussed somethign along these lines already with @mhrmsn and I suspect solving it for his usecase could also solve it here, but overall I must say non-fixed length representation such as these slotted ones are rather complex and anything we might build for it will take time

Scienfitz avatar May 11 '24 18:05 Scienfitz

Hey @Scienfitz, good idea with the constraint approach, could imagine that one could cook something up in that direction, too πŸ‘

However, I see two significant advantages of the kernel idea:

  1. Unlike in the constraint approach, the variable-length property would be intrinsically part of the surrogate model (just like we discussed already many times for the permutation invariance constraint)
  2. I think if @brandon-holt was able to provide an appropriate kernel, the mechanism could be implemented even with todays code, now that the kernel PRs have been merged, requiring no additional backend changes.

But let's see if @brandon-holt could provide such a kernel, before we start diving into it πŸ™ƒ

AdrianSosic avatar May 11 '24 18:05 AdrianSosic

@AdrianSosic So I did some quick research and found some kernels that may be worth considering. Let me know if you think any of these would be viable:

Dynamic Time Warping kernel - https://github.com/pfmarteau/KDTW

Graph kernel - https://github.com/ysig/GraKeL

Regularized String kernel - https://github.com/pfmarteau/KLevenshtein

brandon-holt avatar May 28 '24 13:05 brandon-holt

Hi @brandon-holt, thanks for sharing. I think the graph kernel stuff you shared seems promising (not so convinced about the other two, tough). Nonetheless, it probably helps if I rephrase my thoughts a little bit, to bring us on the same page. When I say you could use graph kernels, we need to distinguish:

  • The technical feasibility: Mathematically, the requirement we have is to compute a "similarity/distance" between to vectors of different dimensionality. If you treat your vector as a graph, then graph kernels can do this job for you, no doubt here.
  • The use case context: The above is only the mathematical perspective. However, I can't really answer whether or not it makes sense for your application to define a distance in this way – in particular because I don't know how the graph kernels you shared work in detail. This is something you need to answer instead, I'm afraid.

Still, I can outline the necessary steps that I see how these kernels could be integrated into the framework.

  1. First, you'd need to make the kernels accessible to baybe in some way. Because the repo you shared is for scikit-learn, they are not directly compatible, but there are three ways to get it work:
    • Find a gpytorch-compatible version of the kernels
    • Write the gpytorch version yourself. This is in fact rather easy: https://docs.gpytorch.ai/en/latest/examples/00_Basic_Usage/Implementing_a_custom_Kernel.html
    • Implement a baybe surrogate that works with scikit-learn instead of gpytorch. There are already some examples of this in our code base, and the process will soon become even simpler because I'm planning to add a SurrogateProtocol soon that exactly declares what needs to be done.
  2. Once the kernel is ready, we need to somehow propagate the variable-length vectors to it. An easy way to achieve this is:
    • Use a custom parameter encoding where you first pad the vectors with zeros, just as you do it currently.
    • Additionally store the length of each vector in it's encoding, e.g. by placing the corresponding integer number in front.
    • Wrap the original kernel such that the wrapper first reads the vector length from the encoding, then extracts the variable-length part and passes only that to the original kernel.

What do you think, does that sound feasible to do?

AdrianSosic avatar Jun 03 '24 08:06 AdrianSosic

Since there has not been an update here for quie some time, what is the status?

AVHopp avatar Sep 03 '24 13:09 AVHopp

Closing this issue since there has been no interaction since over 2 months. Feel free to reopen.

AVHopp avatar Sep 19 '24 06:09 AVHopp

Hi @brandon-holt, since this is now closed until further notice, I just wanted to briefly restate I think my sketched approach is easy to implement, especially with the redesigned Surrogate architecture in 0.11.0, which let's you straightforwardly define your own models. So as long as you can come up with a reasonable proximity measure for two vectors of different lengths that fits your application, then the length-based encoding I described in combination with a slim surrogate wrapper would quickly bring you to your goal.

AdrianSosic avatar Sep 20 '24 15:09 AdrianSosic