ne-spectrum icon indicating copy to clipboard operation
ne-spectrum copied to clipboard

Attraction-Repulsion Spectrum in Neighbor Embeddings

This repository holds the code for https://www.jmlr.org/papers/v23/21-0055.html: Attraction-Repulsion Spectrum in Neighbor Embeddings.

If you use the work herein, we'd appreciate the following citation: #+begin_src @article{boehm2022attraction, author = {Jan Niklas Böhm and Philipp Berens and Dmitry Kobak}, title = {Attraction-Repulsion Spectrum in Neighbor Embeddings}, journal = {Journal of Machine Learning Research}, year = {2022}, volume = {23}, number = {95}, pages = {1--32}, url = {http://jmlr.org/papers/v23/21-0055.html} } #+end_src

  • Structure/Installation

After all instructions in this section have been completed, the code can be installed via

#+begin_src sh git clone https://github.com/berenslab/ne-spectrum cd ne-spectrum pip install --user -r requirements.txt python setup.py build mv bh*.so jnb_msc/transformer/ pip install --user -e . #+end_src

The above command will probably fail to compile the cython extensions. For that you need to install/compile [[https://github.com/pavlin-policar/openTSNE][openTSNE]] manually (clone the repo and install it similarly as above). This project has a build time dependency on a build time artifact (the file =quad_tree.pxd=) that is not installed along openTSNE by default.

After installing openTSNE this way you have to adapt the two lines in =setup.py= that point to the locally installed openTSNE folder, so that during the build process the missing file can be found.

Furthermore, you need a patched version of forceatlas2 from [[https://github.com/jnboehm/forceatlas2]], where degree repulsion has been added to fa2. Install it as follows

#+begin_src sh git clone https://github.com/jnboehm/forceatlas2 cd forceatlas2 rm fa2/fa2util.c python setup.py build pip install --user -e . #+end_src

There is also a =requirements.txt= file to install the dependencies. The code has been run in a conda environment with python 3.8.

The preprocessing script for the treutlein dataset resides in =static/=.

  • Running the code To create a figure, you can simply [[*What are all those .do files?][redo]] one of the files in =media/=. For example, after installing redo, you can write =redo -j6 media/ar-spectrum.pdf=. This will make sure that the data is present and up-to-date and generate the figure. The instructions are written in the file =media/ar-spectrum.pdf.do=. This calls out to redo again ([[file:media/ar-spectrum.pdf.do::redo.redo_ifchange(datafiles + [plotter.labelname, plotter.rc])][l. 268, in =media/ar-spectrum.pdf.do=]]), which will recurse until all dependencies have been satisfied and afterwards create the figure. The file itself is written in python, although the do file itself is language agnostic and can be set by the shebang (=#!=) in the first line of the file.

To see which parameters have been set one can investigate which filenames are generated by the script (look at what is supplied to =jnb_msc.redo.redo_ifchange(...)=). This shows what parameters are deviating from the defaults set in the class definition.

  • Code structure

The classes in the project are all derived from a single base class. It forsees that every subclass implements four methods:

  1. =get_datadeps()=
  2. =load()=
  3. =transform()=
  4. =save()=

The first function allows to query the object what files it needs, this is used by [[*What are all those .do files?][redo]] in order to track the dependencies properly. The other remaining functions should be more or less self explanatory. It is of course also possible to use the algorithms manually. For that the =.data= field needs to be populated with suitable data and possibly the field =.init=, depending on the algorithm at hand.

There are four major different types:

  1. =GenStage=
  2. =NDStage=
  3. =NNstage=
  4. =SimStage=

=GenStage= is the root class for the classes that will generate a dataset. This can be simulated data or simply taking a dataset and putting it in the correct place (again, for redo and this project structure). =NDStage= will take in an =NxD= matrix and reduce its dimensionality to a lower one; one example for this would be PCA. =NNStage= can take the same input as =NDStage= (but usually takes the output of e. g. PCA) and will turn this into an =NxN= affinity/adjaceny matrix. This can then, in turn, be fed into the last one, =SimStage=. These types of classes take in both an =NxN= matrix and an =NxD= (=D=2=) array, that will serve as the initial layout.

There are further minor classes, for examle simple classes that will rescale the input to have a predefined std or maximum scale (code in =jnb_msc/transformer/scale.py=).

If anything is unclear, please let me know.

  • What are all those .do files?

This repository uses [[https://github.com/apenwarr/redo/][redo]] to essentially “cache” the computations that are carried out by the experiments. It works similar to make in that it tries to guess what files have been changed and what parts needs to be rebuilt. I chose this approach so that I wouldn't have to either recompute everything every time or manually change the code to either load a (possibly stale) file or recompute it and save it.

For more information, the (rough) notes on the original design are [[http://cr.yp.to/redo.html][here]].

Unfortunately, the implementation I am using is written in python2 and hence needs to be installed separately. It is not strictly necessary to install this library, but all the code to generate the figures uses this to check the presence (and staleness) of the files. Furthermore, the =load()= and =save()= functions are written with redo in mind.

For example, to get an image of t-SNE on MNIST, one could write in the root of the repository: #+begin_src sh redo 'data/mnist/pca/affinity/stdscale;f:1e-4/tsne/data.png' #+end_src This will “generate” the dataset MNIST, then reduce it with PCA to 50 dimensions, the default here. Afterwards it will calculate the pairwise affinities from that. Then the std will be set to the value given and finally tsne will be run with the scaled dense =NxD= matrix and the =NxN= matrix for its affinities. After the optimization, the embedding (named =data.npy=) will be used to create a scatter plot, which will in turn be saved as =data.png=. This file can then be viewed.

The prefix =data/= is not mandatory. It can be omitted or it can be structured in any way. The “effect” of the other folder names is shown in =jnb_msc/util.py=. The names are resolved to classes. Further arguments, in colon-separated pairs, can be separated with a semicolon, for example =stdscale= will be called with =f=1e-4=.

** =prepped/= The folder =prepped/= is used to dump all the produced files by the algorithms. This has two reasons. Firstly, it prevents clutter in the main directories. Secondly, this way the files can actually be tracked via redo since it does not support multiple output files from one run. For more information on that, see also [[https://redo.readthedocs.io/en/latest/cookbook/latex/][the documentation]] (the heading “Virtual targets, side effects, and multiple outputs”).