MultivariateStats.jl
MultivariateStats.jl copied to clipboard
Mutating labels when training MulticlassLDA
This recent change really threw a wrench into my pipeline: https://github.com/JuliaStats/MultivariateStats.jl/blob/bf15ed047bb44dc3f2ebc5db73a9b43df3a55059/src/lda.jl#L522
I am training LDAs on one set of trials and testing the decoding performance on a separate set of trials. All of a sudden, my performance dropped to chance and after about a day of digging around, I realised that toindices actually mutates the label names. In other words, when I was decoding the testset by finding the projected mean that each sample was closest to, I was using the original labels for my testing, and so the class assignments were all essentially random.
As a stopgap measure for my pipeline, I defined
MultivariateStats.toindices(label::AbstractVector{T}) where T <: Integer = label
which fixed my issue, but I realise that this is not general solution. In particular, if there are gaps in label, such that maximum(label) !== length(unique(label)), this could also cause problems.
Is there currently an array type that fulfils that criteria?
I see. That was a problem with previous implementation. The labels and indices were conflated which caused bounds errors if labels weren't properly defined, #187. It looks like the design problem because the LDA model doesn't carry any explicit information about labels. Class centroids relate to an index of a label rather than the label itself. You can use toindices to get a map of labels to indices and use this map get correct class centroid and weight data.