TSne.jl icon indicating copy to clipboard operation
TSne.jl copied to clipboard

tnse() treats input matrix X rows as observations, should be columns

Open alyst opened this issue 8 years ago • 6 comments
trafficstars

tsne(X) treats X rows as observations (points) and columns as their features (dimentions). I guess it's because the original implementation is written in Python/NumPy, which uses row-major order. However, Julia packages (Clustering.jl, Distances.jl, BlackBoxOptim.jl) tend to treat rows as features and columns as observations, because in Julia matrices are stored in column-major order (one column/observation occupies continuous block of memory).

IMO it would be nice to switch to the default Julian behaviour at some point, but that would be a big breaking change, of course.

alyst avatar Nov 21 '17 23:11 alyst

Other notable package also treat columns as observations: MultivariateStats.jl, MLDataUtils.jl (though that is configurable, columns is the default), Distributions.jl

oxinabox avatar Dec 01 '17 04:12 oxinabox

I agree. Row-major can actually cause performance issue: fetching a row is slow than fetching a column in Julia

xukai92 avatar Jan 15 '18 00:01 xukai92

One more vote for column-major ordering.

TSne.jl could be absorbed by MultivariateStats.jl for better maintenance, peer review of PRs, and performance improvements.

juliohm avatar May 13 '18 00:05 juliohm

@juliohm Technically speaking, you can submit PRs and get peer review right away. E.g. you are very welcome to submit col-major PR. Integration into MultivariateStats.jl (or the move under JuliaStats umbrella, as a more modest alternative) could make it easier for the users to discover tsne() and guarantee some minimal level of maintenance. But, based on my experience, I don't expect the overflow of PRs because of the code relocation.

alyst avatar May 13 '18 12:05 alyst

@juliohm if someone were to implement this into MultivariateStats (which I think is a good idea), it will need to be clean-room implemented from the paper, or some other description of the algorithm as the license on this repo is not MIT compatible

oxinabox avatar May 14 '18 04:05 oxinabox

Yes, it is unfortunate that t-SNE is currently separate from the rest of the embedding methods in MultivariateStats.jl. I hope someone will tackle this issue in the future.

juliohm avatar May 14 '18 04:05 juliohm