graspologic
graspologic copied to clipboard
implement anomaly detection in time series of graphs
paper forthcoming
@bdpedigo this became important, as per cep's request, because JL@MSR wants it in graspy now.
Guodong has an R package, so it should be a simple matter of porting to python
, unclear who is the right person for this?
Maybe @j1c ?
@jovo could we get more context? you have the R code/paper (I think I saw a pdf a long time ago but haven't seen anything recently)?
Ask Guodong
Typed with thumbs
@bdpedigo Thoughts on api? There are few wrinkles in this.
- There is no out of sample prediction. You train and predict on the input dataset, so standard
fit
andfit_predict
doesn't make sense. It could be a class with justfit_predict
, but then it could just be a function. - There is graph-wise and vertex-wise anomaly detection. But the embedding process is the same for both, so you can do both graph-wise and vertex-wise anomaly detection with one set of embeddings.
So things to decide is:
- Should graph-wise and vertex-wise be in same class/function or separate. I think it make sense to include both in same class/function since embedding is the more computationally intensive part
- Should it be a function or class. With function, it requires 5+ things to be returned, but it also doesn't conform to classic sklearn style class. We could do something like mgc does and return
(graph_anomaly_indices, graph_anomaly_dict, vertex_anomaly_indices, vertex_anomaly_dict)
or something along those lines. No opinions on this one.
- From your description, I agree that graph-wise and vertex-wise make sense in the same class/function
- class/function is a bit harder. ya, it sounds like neither are super convenient. I guess this is why scipy has stuff like OptimizeResult which is basically just a dict. Though I actually don't know if there are any advantages to using that over just a dict, normally I'd say no to a class just to store data right? I feel like based on all of this I'm leaning towards function w/ dict output but we could also ask Dwayne.
@bdpedigo @j1c do we still want this? I have code in a jupyter notebook I wrote for the book I'm about to clean up
one issue I ran into while implementing this / playing around with it in python is that it's super not robust to embedding dimension. As in, I looked at the distribution of true test statistics under the null and bootstrapped test statistics under the null, and if the embedding dimension is different, they look completely different from each other and the test is broken.
See below. The experiment is:
-
Starting without adding any anomaly graphs or anything fancy, generate a single latent position matrix X = X_1, …, X_n \in R ~ Uniform[.2, .8]
-
Generate A_1, A_2 ~ RDPG(X)
-
Generate Xhat from OMNI(A_1, A_2), then taking the first n rows
-
Iterate 1000 times:
- Generate _A, _B ~ rdpg(X)
- Embed with OMNI(_A, _B) to get latent position estimates _X, _Y
- let y = ||_X - _Y||, l2 operator norm (largest singular value of difference)
- plot distribution of y as histogram
- Iterate 1000 times:
- Generate A_, B_ ~ rdpg(Xhat)
- Embed with OMNI(A_, B_) to get latent position estimates X_, Y_
- let y_ = ||X_ - Y_||
- plot distribution of y_ as histogram
The first image is what happens when you (incorrectly) embed down to two dimensions, and the second image is what happens when you (correctly) embed into one dimension.

This issue is somewhat alleviated in practice if you embed every adjacency matrix you have at once and then just take pairs (maybe since the variance is lower?) but you'll still get a pretty big bias.
Was thinking of hopping on zoom with Guodong at some point to talk about this
@bdpedigo @j1c do we still want this? I have code in a jupyter notebook I wrote for the book I'm about to clean up
my first question would be "what is someone going to use it for" - generally speaking I only think it is worth maintaining code that is being used by us/MSR/someone else. To me, if it is just for the book, that doesn't feel like a compelling enough case IMO
one issue I ran into while implementing this / playing around with it in python is that it's super not robust to embedding dimension. As in, I looked at the distribution of true test statistics under the null and bootstrapped test statistics under the null, and if the embedding dimension is different, they look completely different from each other and the test is broken.
This is the case for basically everything spectral-stats related that I have seen. Everything is super super dependent on embedding dimension, so I am not too surprised by this.