KernelFunctions.jl
KernelFunctions.jl copied to clipboard
Domain of kernel input
Just opening a separate thread following #14, #10 and #3, to talk about how to parametrize the kernels.
The idea would be to parametrize the kernel as Kernel{T,Transform} where T is the type of the input.
Here are some thoughts :
- It makes sense to define the input type in the kernel parameters as I cannot imagine that the input type would change over time
- Does it mean that we restrict the kernel to only work on those inputs, as this could cause problems with the AD types ?
- Should we also do it for
Transformas it is the first type to actually "see" the inputs - This could be definitely practical for specific kernels as asked in #9, #10 and #14
What do you think @willtebbutt @trthatcher ?
Overall, I'm of the opinion that parametrising the input type into the kernel would not be the right move. This is for the following reasons:
- I can't see any compelling reason to do so. If there's a good reason to specify it in the type that I'm unaware of, I would very much like to know about it 🙂 (I've responded directly to #14). In the absence of a good reason it just adds unnecessary complexity and should, therefore, be avoided.
- The complexity it adds is quite annoying. In particular, it requires that you know things like the element type of the inputs, and whether or not they happen to reside on the GPU (i.e. do I have a
Vector{Float64}orCuVector{Float32}?), at compile time. In turn, this would mean you need different kernels depending on whether or not this is the case. While this obviously isn't a catastrophic requirement, it is really very annoying -- I would much rather just be able to construct a kernel and have it play nicely with whatever inputs I give it, only complaining if I've given it something that it doesn't know how to handle. - Taking the view of a kernel as a binary function, albeit with some special properties, typing the function in terms of concrete input types is quite an unusual thing to do in Julia. The "right" way to do this is to only constrain the inputs when necessary e.g. when defining
kernelmatrix. This would be akin to writing
struct Foo{T<:Real} end
bar(::Foo{T}, x::T) where {T} = ...
as opposed to simply
struct Foo end
bar(::Foo, ::Real) = ...
Addressing your points more specifically:
- I agree, but don't think that this is sufficient motivation to bake it in to the type.
- Possibly. Hopefully less so as the community moves away from
ForwardDiff.jl,ReverseDiff.jl,Tracker.jletc towards Zygote.jl and ForwardDiff2.jl. - I'm not pro- having a
Transformbaked into the type, and am unclear on the reasoning for doing so currently. Will open a separate issue about this. - Each of the three issues mentioned would be similarly well-solved by loose type constraints on inputs and appropriate wrapper types for inputs (as I've discussed here. This is a more generic solution in the sense that if you want to use a new type, there are a clear set of things that you need to do to get the existing infrastructure to work for you, and most of it will just work as intended for new inputs.
I completely agree that a kernel should not force its user to decide at compile time whether it expects an Array or CuArray as input - this largely has nothing to do with the kernel.
On the other hand, I do think there is value in including additional type parameters in distance-based kernels: specifically, both the distance used to create it, and the dimension of the space on which it's defined, with the latter encoded in a value type.
Something like
SqExponentialKernel{T, Tr, D <: Distances.SemiMetric, N::Val{M} where M}
would be nice, because one can then write
function dim(k::SqExponentialKernel{<:Any,<:Any,<:Distances.SqEuclidean,Val{N}) where N
return N
end
which will return N=1 for a 1-dimensional squared exponential kernel. This can be made nicer by defining custom wrappers for value types and other similar syntactic sugar.
This would make methods like spectral_density possible without running in to type issues like those described above. An alternative is to place the dimension as a field inside the struct.
It also matches well with the underlying mathematics: whether or not a kernel is positive semi-definite generally depends on the kernel's domain. For example, there is a theorem which (I hope I'm not mis-remembering) that states that for all choices of length scale, the squared exponential kernel is positive semi-definite over a Riemannian manifold M if and only if M is flat in the sense of Alexandrov. So mathematically, the domain of a kernel is part of the kernel's definition.
On the other hand, I do think there is value in including additional type parameters in distance-based kernels: specifically, both the distance used to create it, and the dimension of the space on which it's defined, with the latter encoded in a value type.
If it is indeed the case that we would want to construct the Squared Exponential kernel using something other than the squared euclidean distance I could buy the value of encoding the particular metric used in the type. Is there such an example that we care about (that can't just be constructed via the squared euclidean distance on a transformation of the inputs)?
I should also be more explicit about the consequences syntactically of including dimensionality and metric info in the kernel. Say you want to construct a composite kernel of the form (apologies in advance for using Stheno syntax -- it's just what I've found to be easiest for this stuff):
a * stretch(eq(), p) + b * stretch(eq(), q) * periodic(eq(), r)
where
eq()constructs an exponentiated quadratic / squared exponential kernel(a * k)(x, y) = a * k(x, y)stretch(k, p)(x, y) = k(p * x, p * y)periodic(k, r)returns "periodicised" kernelkperiodic with periodr
Assuming that we drop the T and Tr type parameters, we're left with something that will look like
a * stretch(eq(2, SqEuclidean), p) + b * stretch(eq(2, SqEuclidean), q) * periodic(eq(2, SqEuclidean), r)
which I would argue is significantly worse than before.
This would make methods like spectral_density possible without running in to type issues like those described above. An alternative is to place the dimension as a field inside the struct.
I'm still not sold on this being sufficient justification for the additional code complexity necessary for carrying around enough information to uniquely specify the domain of the kernel. I'm happy to be convinced otherwise, but my practical experience thus far has been that including the minimal possible information required to compute the value of the kernel when presented with inputs is the way forwards.
@willtebbutt I want to create kernels on distributions through the MMD which is a metric (given some regularity conditions) on distributions. In the case for a convex combination of dirac deltas the MMD between two elements P and Q can probably be written in the Euclidean norm, but would make things a bit more difficult and writing it explicitly in terms of sums is easier.
This is just one example, maybe there are others (kernels on exotic domains?) but I don't know enough about them or how much they are used in practice. I totally understand keeping complexity to a minimum though and I'm not sure what the right answer is.
Working a bit with @niklasschmitz on a special case where permutations and other symmetries are used I realised we need to split the problem in two :
- There is the case where to compute the global kernel function we need to call a local kernel function repeatedly over samples/permutations etc. I think that is the case for MMD for instance.
- And there is the case where the only thing needed is a special metric/kernel function etc (I don't have any example in mind)
For the first one I will join @willtebbutt side and propose that it should be treated externally. Maybe a KernelFunctionsMeta.jl? There we could be more free to work on more specific cases one by one.
For the second one this is still relevant to the package and should be added when needed.
So as a general comment I would also be against specifying the input type as a parameter of the kernel. It also make it complicated for composite kernel like KernelSum or KernelProduct.
Another use case: if one has a Gaussian process f, then \grad f will also be a Gaussian process. In this case, Cov(f, \grad f) is given by the gradient of the kernel. The dimension of the input needs to be specified if one wants to compute this gradient.
Let's continue the discussion on all these things long-term. I think as the ecosystem develops and more serious use cases start popping up, the right API will become apparent. My current view is that adding input dimension as a value type is helpful, whereas adding something too complicated is probably not.