graspologic implement anomaly detection in time series of graphs

paper forthcoming

Aug 06 '20 21:08 bdpedigo

@bdpedigo this became important, as per cep's request, because JL@MSR wants it in graspy now. Guodong has an R package, so it should be a simple matter of porting to python, unclear who is the right person for this? Maybe @j1c ?

Aug 13 '20 19:08 jovo

@jovo could we get more context? you have the R code/paper (I think I saw a pdf a long time ago but haven't seen anything recently)?

Aug 14 '20 15:08 bdpedigo

Ask Guodong

Typed with thumbs

Aug 14 '20 16:08 jovo

@bdpedigo Thoughts on api? There are few wrinkles in this.

There is no out of sample prediction. You train and predict on the input dataset, so standard fit and fit_predict doesn't make sense. It could be a class with just fit_predict, but then it could just be a function.
There is graph-wise and vertex-wise anomaly detection. But the embedding process is the same for both, so you can do both graph-wise and vertex-wise anomaly detection with one set of embeddings.

So things to decide is:

Should graph-wise and vertex-wise be in same class/function or separate. I think it make sense to include both in same class/function since embedding is the more computationally intensive part
Should it be a function or class. With function, it requires 5+ things to be returned, but it also doesn't conform to classic sklearn style class. We could do something like mgc does and return (graph_anomaly_indices, graph_anomaly_dict, vertex_anomaly_indices, vertex_anomaly_dict) or something along those lines. No opinions on this one.

Sep 01 '20 16:09 j1c

From your description, I agree that graph-wise and vertex-wise make sense in the same class/function
class/function is a bit harder. ya, it sounds like neither are super convenient. I guess this is why scipy has stuff like OptimizeResult which is basically just a dict. Though I actually don't know if there are any advantages to using that over just a dict, normally I'd say no to a class just to store data right? I feel like based on all of this I'm leaning towards function w/ dict output but we could also ask Dwayne.

Sep 03 '20 15:09 bdpedigo

@bdpedigo @j1c do we still want this? I have code in a jupyter notebook I wrote for the book I'm about to clean up

Nov 18 '21 16:11 loftusa

guodong's paper book section

one issue I ran into while implementing this / playing around with it in python is that it's super not robust to embedding dimension. As in, I looked at the distribution of true test statistics under the null and bootstrapped test statistics under the null, and if the embedding dimension is different, they look completely different from each other and the test is broken.

See below. The experiment is:

Starting without adding any anomaly graphs or anything fancy, generate a single latent position matrix X = X_1, …, X_n \in R ~ Uniform[.2, .8]
Generate A_1, A_2 ~ RDPG(X)
Generate Xhat from OMNI(A_1, A_2), then taking the first n rows
Iterate 1000 times:

Generate _A, _B ~ rdpg(X)
Embed with OMNI(_A, _B) to get latent position estimates _X, _Y
let y = ||_X - _Y||, l2 operator norm (largest singular value of difference)
plot distribution of y as histogram

Iterate 1000 times:

Generate A_, B_ ~ rdpg(Xhat)
Embed with OMNI(A_, B_) to get latent position estimates X_, Y_
let y_ = ||X_ - Y_||
plot distribution of y_ as histogram

The first image is what happens when you (incorrectly) embed down to two dimensions, and the second image is what happens when you (correctly) embed into one dimension.

n100_m2_rdpg_dhat_be_1

This issue is somewhat alleviated in practice if you embed every adjacency matrix you have at once and then just take pairs (maybe since the variance is lower?) but you'll still get a pretty big bias.

Was thinking of hopping on zoom with Guodong at some point to talk about this

Nov 18 '21 16:11 loftusa

@bdpedigo @j1c do we still want this? I have code in a jupyter notebook I wrote for the book I'm about to clean up

my first question would be "what is someone going to use it for" - generally speaking I only think it is worth maintaining code that is being used by us/MSR/someone else. To me, if it is just for the book, that doesn't feel like a compelling enough case IMO

Nov 18 '21 17:11 bdpedigo

one issue I ran into while implementing this / playing around with it in python is that it's super not robust to embedding dimension. As in, I looked at the distribution of true test statistics under the null and bootstrapped test statistics under the null, and if the embedding dimension is different, they look completely different from each other and the test is broken.

This is the case for basically everything spectral-stats related that I have seen. Everything is super super dependent on embedding dimension, so I am not too surprised by this.

Nov 18 '21 17:11 bdpedigo

graspologic graspologic copied to clipboard

implement anomaly detection in time series of graphs

graspologic
graspologic copied to clipboard