MultivariateStats.jl Mutating labels when training MulticlassLDA

Mutating labels when training MulticlassLDA

Open grero opened this issue 3 years ago • 1 comments

This recent change really threw a wrench into my pipeline: https://github.com/JuliaStats/MultivariateStats.jl/blob/bf15ed047bb44dc3f2ebc5db73a9b43df3a55059/src/lda.jl#L522

I am training LDAs on one set of trials and testing the decoding performance on a separate set of trials. All of a sudden, my performance dropped to chance and after about a day of digging around, I realised that toindices actually mutates the label names. In other words, when I was decoding the testset by finding the projected mean that each sample was closest to, I was using the original labels for my testing, and so the class assignments were all essentially random.

As a stopgap measure for my pipeline, I defined

MultivariateStats.toindices(label::AbstractVector{T}) where T <: Integer = label

which fixed my issue, but I realise that this is not general solution. In particular, if there are gaps in label, such that maximum(label) !== length(unique(label)), this could also cause problems. Is there currently an array type that fulfils that criteria?

Oct 07 '22 04:10 grero

I see. That was a problem with previous implementation. The labels and indices were conflated which caused bounds errors if labels weren't properly defined, #187. It looks like the design problem because the LDA model doesn't carry any explicit information about labels. Class centroids relate to an index of a label rather than the label itself. You can use toindices to get a map of labels to indices and use this map get correct class centroid and weight data.

Oct 07 '22 17:10 wildart

MultivariateStats.jl MultivariateStats.jl copied to clipboard

Mutating labels when training MulticlassLDA

MultivariateStats.jl
MultivariateStats.jl copied to clipboard