graspologic
graspologic copied to clipboard
[Question] LatentDistributionTest n_components incorrect?
This is our forum for asking whatever network question you'd like! No need to feel shy - we're happy to talk about graphs!
I have been trying to run LatentDistributionTest using several graphs. On some pairs of graphs I get the following error: "n_components must be <= min(X.shape)". This comes from svd.py line 241. Originally I left n_components as None to let Graspologic choose the correct value. Then I tried setting n_components to 256.
I have noticed that some of the graphs that fail have a low edge count.
I am sending in NetworkX graph objects and letting graspologic handle the embedding.
In addition to the issue above, the LatentDistributionTest code runs for six hours or so but does not seem to take much memory or CPU time.
Can you elaborate on the expectations for n_components? Should I do something different?
This is our forum for asking whatever network question you'd like! No need to feel shy - we're happy to talk about graphs!
I have been trying to run LatentDistributionTest using several graphs. On some pairs of graphs I get the following error: "n_components must be <= min(X.shape)". This comes from svd.py line 241. Originally I left n_components as None to let Graspologic choose the correct value. Then I tried setting n_components to 256.
I have noticed that some of the graphs that fail have a low edge count.
I am sending in NetworkX graph objects and letting graspologic handle the embedding.
In addition to the issue above, the LatentDistributionTest code runs for six hours or so but does not seem to take much memory or CPU time.
Can you elaborate on the expectations for n_components? Should I do something different?
hey @jonmclean! the number of components is the number of dimensions that you want to embed into. typically this is a small number compared to the graph size. sometimes one can have a priori knowledge of the number of components. for example, if you have a hypothesis that your graph comes from a stochasitc blockmodel with three blocks, it makes sense to embed it into three dimensions. in general, however, when there is no such knowledge, we suggest using Zhu and Ghodsie automatic selection procedure, which you can do by setting n_components to None. it does not make sense for the number of your components to be larger than the size of your graph. a slightly handwaivy comparison with regular data science would be to try to take a PCA into 100 components of a data that has 10.
we would need more information regarding the LDT taking so much time. in particular, could you provide the size of your data, the exact LDT function call, and ideally your specs.
cheers
When I set n_components to None so it would use the Zhu and Ghodsie selection procedure I got this error:
"n_components must be <= min(X.shape)". This comes from svd.py line 241. It seemed to be choosing an invalid n_components.
When I set n_components to None so it would use the Zhu and Ghodsie selection procedure I got this error:
"n_components must be <= min(X.shape)". This comes from svd.py line 241. It seemed to be choosing an invalid n_components.
Could you try to pass your two graphs into the AdjacencySpectralEmbedding individually? you’d have to use .fir_transform there.
@jonmclean can you share the graph sizes for us?
agree that graph sizes would help - but this did help me realize a very weird edge case where if the ZG2 elbow for one graph is bigger than the size of the other graph I think this error would get thrown. I believe we take the max ZG2 elbow between the two graphs.
I am calculating the latent distribution test for the cross-product of the following graphs:
GraphA has 1695391 edges GraphB has 463153 edges GraphC has 2101995 edges GraphD has 161548 edges GraphE has 50 edge GraphF has 5 edge GraphG has 1 edge GraphH has 62948 edge
So I am calculating the latent distribution test for (GraphA, GraphB), (GraphA, GraphC), etc etc.
As @alyakin314 suggested, I changed the code to run ASE on each graph first (n_components = 256). To get past this issue I skipped all graphs with fewer than 256 edges. Comparing the smaller graphs to the large ones would not be likely to return much useful information anyway.
This is the code I am running. "graphs" is a dictionary of id to NetworkX graph object.
embeddedGraphs = {}
for id, graph in graphs.items():
logger.info(f'{id} contains {graph.number_of_edges()} edges. Starting embedding')
if graph.number_of_edges() > 256:
numpyGraph = nx.to_numpy_array(graph, nodelist=sorted(graph.nodes), dtype=np.float)
ase = AdjacencySpectralEmbed(n_components=256)
embeddedGraphs[id] = ase.fit_transform(numpyGraph)
logger.info(f'Finished embedding {org}')
else:
logger.info(f'{id} does not meet the minimum edge count. Skipping')
from graspologic.inference import LatentDistributionTest
test = LatentDistributionTest(input_graph=False, workers=12)
for id1, graph1 in embeddedGraphs.items():
for id2, graph2 in embeddedGraphs.items():
logger.info(f'Calculating nonpar for {id1} and {id2}')
result = test.fit_predict(graph1, graph2)
logger.info(f'{id1} - {id2} RESULT: {result}')
I am calculating the latent distribution test for the cross-product of the following graphs:
GraphA has 1695391 edges GraphB has 463153 edges GraphC has 2101995 edges GraphD has 161548 edges GraphE has 50 edge GraphF has 5 edge GraphG has 1 edge GraphH has 62948 edge
So I am calculating the latent distribution test for (GraphA, GraphB), (GraphA, GraphC), etc etc.
As @alyakin314 suggested, I changed the code to run ASE on each graph first (n_components = 256). To get past this issue I skipped all graphs with fewer than 256 edges. Comparing the smaller graphs to the large ones would not be likely to return much useful information anyway.
sorry, to clarify: by size i mean number of vertices.
@jonmclean is currently running the job and "when it fails, I'll get the number of vertices"
@alyakin314 here are the vertex counts for each graph:
GraphA contains 1695391 edges and 13106 nodes. GraphB contains 463153 edges and 7640 nodes. GraphC contains 2101995 edges and 12094 nodes. GraphD contains 161548 edges and 5325 nodes. GraphE contains 50 edges and 13 nodes. GraphF contains 5 edges and 5 nodes. GraphG contains 1 edges and 2 nodes. GraphH contains 62948 edges and 1944 nodes.
@alyakin314 here are the vertex counts for each graph:
GraphA contains 1695391 edges and 13106 nodes. GraphB contains 463153 edges and 7640 nodes. GraphC contains 2101995 edges and 12094 nodes. GraphD contains 161548 edges and 5325 nodes. GraphE contains 50 edges and 13 nodes. GraphF contains 5 edges and 5 nodes. GraphG contains 1 edges and 2 nodes. GraphH contains 62948 edges and 1944 nodes.
Based on the number of vertices I think this is what's happening
but this did help me realize a very weird edge case where if the ZG2 elbow for one graph is bigger than the size of the other graph I think this error would get thrown. I believe we take the max ZG2 elbow between the two graphs.
@bdpedigo Would this case happen if you are explicitly supplying n_components?
I don't believe so, as long as the number of components is always smaller than the size of the smallest graph. If you supply n_components
that is bigger than the size of the smallest graph (or maybe size of smallest graph - 1 for reasons I can explain) then you'd get a similar error.
Practically speaking, for @jonmclean's problem, is it possibly to just throw out graph E, F, and G? In general I don't think this method would even make much sense for graphs that small.
Regarding what to do in the code, one hacky solution when n_components
is not passed is
n_components = min(min(len(A), len(B)), max(elbow_A, elbow_B))
When n_components
is specified but doesn't work with the size of the graphs, do we want some other kind of error to be thrown? I think this is a question for our ASE/SVD implementations.
Also just a note on usage - I have found dimensionality that you select to matter a lot in terms of the results that you get with this method, so may be worth exploring if trying to use on real data.
I have ignored the smaller graphs. I think this is mostly a documentation & error message concern. I propose the following documentation changes:
- Make the relationship between n_components and the size of the graph clear.
- Make a note in the method documentation about issues that may occur with auto-calculated n_components when there is a large disparity in the size of the two graphs.
I am new to working with graphs so I found the error messages confusing.
Just a sidebar comment, but this is a perfect example of keeping in mind what sorts of users are going to be using graspologic - it isn't just graph benders, but any developer who has a graph and is trying to figure out the tools for the first time.