zeno icon indicating copy to clipboard operation
zeno copied to clipboard

calculate projection in preprocessing

Open cabreraalex opened this issue 2 years ago • 6 comments

If a user provides embeddings, we should compute the projections as a preprocessing step and cache the result. Will make interaction from then on much, much faster. Can create an option to not compute projections as well if we want.

cabreraalex avatar Feb 12 '23 16:02 cabreraalex

@xnought any thoughts on this? Any downside? One I can think of is you have to store the projection coordinates, using up disk space, but should be minimal?

cabreraalex avatar Feb 13 '23 18:02 cabreraalex

Depending on the data format yeah disk space would not be too bad.

Sidenote: it could be better to use parquet when caching columns for that extra compression.

xnought avatar Feb 13 '23 18:02 xnought

I do like your idea. I think I'll give that a shot next.

xnought avatar Feb 13 '23 18:02 xnought

There is also something else to think about: should users be able to mess with tsne parameters (like perplexity)?

Should the user be able to recompute tsne? Given how different the results are with the tsne parameters, maybe?

xnought avatar Feb 13 '23 18:02 xnought

Also if there dataset is too large and tsne ends up taking the eternities, what then?

That would favor our current method where they can just load one tsne instead or preloading all of them.

xnought avatar Feb 13 '23 18:02 xnought

We could add an option to the TOML that are parameters for the TSNE?

For your last point, if it's too large the current method would be worse because if you leave the screen it would stop processing and lose your progress.

cabreraalex avatar Feb 13 '23 19:02 cabreraalex