MultivariateStats.jl icon indicating copy to clipboard operation
MultivariateStats.jl copied to clipboard

Documentation and implementation of llsq don't agree on data matrix structure

Open smllmn opened this issue 7 years ago • 3 comments

Unless I'm missing something, the implementation (and docs) of llsq don't agree with the statement on the documentation index that data matrices have features as rows and observations as columns.

In the following code from the llsq documentation, the number of observations is 1000, and the number of features is 3, but the observation matrix X has 1000 rows and 3 columns, and the output from llsq is a 3-vector. Note that the code does not use the trans option provided by llsq.

using MultivariateStats

# prepare data
X = rand(1000, 3)               # feature matrix
a0 = rand(3)                    # ground truths
y = X * a0 + 0.1 * randn(1000)  # generate response

# solve using llsq
a = llsq(X, y; bias=false)

# do prediction
yp = X * a

smllmn avatar Jun 10 '17 15:06 smllmn

The docs say: "Here, y can be either a vector, or a matrix where each column is a response vector." So that means features are in columns, and therefore observations are in rows. This goes against the rule stated in documentation index.

I guess the most immediate fix is to make the documentation more explicit. Would you care to make a pull request? Then whether the behavior of the function should be changed to match that of other functions is another issue.

nalimilan avatar Jun 12 '17 11:06 nalimilan

The docs look right to me. The issue that may be confusing is the phrase "each column is a response vector". This is not the same as "each column is a feature vector". The author is referring to the case that multiple y's are being modeled using the same feature matrix X, for example y might be income and experience level, both being predicted using birthday and city.

Indeed rows are supposed to be observations and columns are features for an observation. The reason someone might write the doc like this is because the solution for llsq is

pinv(X'X) times X'y

Whether y is a single variable that is being modeled, or a whole matrix, the formula is still the same. We can accommodate multiple ys by reusing the computation for inv(X'X) times X'.

jeffwong avatar Jul 11 '17 02:07 jeffwong

The docs are right, but the behavior of this function appears to be inconsistent with that of the rest of the API (which treats columns as observations). That's why I said the docs could be improved, by making them more explicit about this exception.

nalimilan avatar Jul 11 '17 08:07 nalimilan