giotto-tda
giotto-tda copied to clipboard
Add datasets module to load and generate toy datasets
Description
scikit-learn has a datasets module that provides handy utility functions to load and generate toy datasets. These functions feature prominently in the scikit-learn examples and it would be nice to have a similar functionality in giotto-tda.
Suggestions for synthetic datasets include:
-
make_point_clouds
: Generate an array of spheres and tori in 3-dimensions with corresponding label (useful for showing persistent homology + shape classification). -
make_time_series
: Generate an array periodic and non-periodic time series with corresponding label (useful for showing sliding window embeddings and time series classification).
Suggestions for point cloud and graph datasets could take inspiration from PyTorch geometric's dataset module
This would be good. @gtauzin and I started doing something along the make_point_clouds
methods you envisioning and manage to get a few nice spaces and constructions on spaces. The reason this was not completed was the lack of uniformity of the sampling. In order to get this done well, the probability function has to be modified by a hessian term associated to the parametrization of the curved space. Maybe we can revisit this point sometime.
Cool, it seems you guys went for the hardcore version :) All I had in mind were spheres and tori with gaussian noise added, but perhaps this is too limiting.
If you have some Python code lying around, you could make GitHub gist and link it in these comments.
The code is not so important, specially since it doesn't do what one would really like it to do, but since you asked, I am sending code that samples a point cloud near the real projective plane embedded in R4.
To get this thing properly done, what we need is a method that can sample an interval according to a costume, non necessarily uniform, probability distribution function. Any leads on something like this?
The first part of this notebook has the sampling functions for S2 and RP2. I just run it and the plotting still works.
I wanted to have a look at the notebook, but i do not have access rights- you should receive an email requesting them.
For sampling from arbitrary densities, something like Metropolis-Hastings? Or, if the density is represented as a discretize array, maybe inverse transform sampling?