pandana
pandana copied to clipboard
`ValueError: Buffer dtype mismatch` when construcing Network from pandas dataframe
Description of the bug
I cannot use pdna.Network()
with my own pandas edge and node dataframes (built from an sql query), there is some sort of type mismatch saying it is getting double when it is expecting long, but all values from the dataframes are integer so I can't pinpoint where this is coming from. The tutorial says the node_x
node_y
and weight
values should be float in any case.
This error doesn't come up when I am using osm.pdna_network_from_bbox()
or I use your example osm_bayarea.h5
data and they even have float numbers, so I am assuming there is a specific way to construct the node and edge dataframes so they can be used by pdna.Network
?
Network data (optional)
Our network is large and sits on a sql database so I'll just show the structure here. I've input a edge dataframe in this format (bigint from a pandas sql query):
from | to | weight |
---|---|---|
1534152 | 1533645 | 839 |
1534051 | 1533659 | 1644 |
1534016 | 1534015 | 200 |
1534024 | 1534016 | 758 |
1534013 | 1534016 | 313 |
And the node data was in this format (bigint from a pandas sql query):
id | x | y |
---|---|---|
1539680 | 486522 | 240589 |
1539682 | 486522 | 240376 |
1539683 | 486531 | 240399 |
1539684 | 486540 | 240513 |
1539686 | 486563 | 240392 |
I also tried making sure the dtype of the data series matched exactly the osm_bayarea.h5
data but I also got the same error.
Edges
id
1840193 1534152
1840213 1534051
1855844 1534016
1855845 1534024
1855841 1534013
Name: from, dtype: int64
id
1840193 1533645
1840213 1533659
1855844 1534015
1855845 1534016
1855841 1534016
Name: to, dtype: int64
id
1840193 839.0
1840213 1644.0
1855844 200.0
1855845 758.0
1855841 313.0
Name: weight, dtype: float32
Nodes
id
1539680 486522.0
1539682 486522.0
1539683 486531.0
1539684 486540.0
1539686 486563.0
Name: x, dtype: float64
id
1539680 240589.0
1539682 240376.0
1539683 240399.0
1539684 240513.0
1539686 240392.0
Name: y, dtype: float64
The only significant difference is that the network is cropped from a larger graph we have, so the node ids don't start from 0 but from an arbitrary point, but I don't know if that affects this.
Thank you very much for your hard work on this package, it is very appreciated and I hope I can help.
Environment
-
Operating system: Ubuntu 16.04
-
Python version: 3.5
-
Pandana version: 0.4
Paste the code that reproduces the issue here:
net=pdna.Network(nodes["x"],
nodes["y"],
edges["from"],
edges["to"],
edges[["weight"]])
Paste the error message (if applicable):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-12c11d512036> in <module>()
3 edges["from"],
4 edges["to"],
----> 5 edges[["weight"]])
~/anaconda3/envs/test-environment/lib/python3.5/site-packages/pandana/network.py in __init__(self, node_x, node_y, edge_from, edge_to, edge_weights, twoway)
84 .astype('double')
85 .as_matrix(),
---> 86 twoway)
87
88 self._twoway = twoway
ana/src/cyaccess.pyx in pandana.cyaccess.cyaccess.__cinit__ (src/cyaccess.cpp:2186)()
ValueError: Buffer dtype mismatch, expected 'long' but got 'double'
The node ids have to be ints, so I'm guessing that for nodes["x"] and/or nodes["y"] the index (not the values) is of type double but should be of type long. Lemme know if that helps.
Thanks for the suggestion, I tested by including:
print(edges.index.dtype)
print(edges["from"].index.dtype)
print(edges["to"].index.dtype)
print(edges["weight"].index.dtype)
print(nodes.index.dtype)
print(nodes["x"].index.dtype)
print(nodes["y"].index.dtype)
And I got int64
printed for all of them. So my dataframe matches the osm_bayarea.h5
for all dtypes for index and columns. However the osm_bayarea.h5
data works fine with pdna.network()
, whereas my dataframe returns the dtype mismatch error.
Hmm, from looking at the code, it's most likely with your edges. You might want to recreate this line of code with your data and see what the type of the resulting index is...
edges_df = pd.DataFrame({'from': edges["from"], 'to': edges["to"]}).join(edge_weights)
Okay I checked it like this:
edge_weights = edges["weight"]
edges_df = pd.DataFrame({'from': edges["from"], 'to': edges["to"]}).join(edge_weights)
print(edges_df.index.dtype)
print(edges_df['from'].index.dtype)
print(edges_df['to'].index.dtype)
print(edge_weights.index.dtype)
Returns int64
for all 4 indexes, this is the same for the osm_bayarea.h5
data too.
The only other thing I can say is different is that the x, y coordinates and weights are just integers turned into floats to fit the API docs (i.e. 486540.0
), but I'm not sure if that relates.
Not sure on this one. My guess is it's something fairly simple we're missing. Might need sample data and sample code to diagnose it...
Agreed, let me do some internal testing with different sample data from different sources (I've only had tried this with the sql derived dataframe) and I'll get back to you either way. Thanks very much for your help!
Hello, I'm having the same issue and just found this thread. Was the problem ever resolved?
@double-u-a we wanted to check in on this to see if you had any updates: https://github.com/UDST/pandana/issues/88#issuecomment-318433914 its been awhile.
@sablanchard Yes I have been really meaning to get back to this, we've been busy completely rebuilding our geodatabase so I haven't had the opportunity to create the sample datasets for testing/reproduction of the issue. Fortunately the datasets should be ready in the next week or two. @lmnoel if you have some test data that reproduces the issue already then please do share in the meantime 👍
I'm trying to merge external data with a set of edges/nodes data frames returned from osm.network_from_bbox(), and from my testing, the mere act of concatenating a single row (with each column matching the dtype of the osm.network_from_bbox() DF's precisely) produces this error. @double-u-a @sablanchard
Edit: I think I have solved my issue. It turned out there was an issue with how I was constructing my DF to merge with the osm.network_from_bbox() DF, such that not every node in the edges DF was contained in the nodes DF. An explicit check/warning for this in the Net constructor might be helpful.
I suggest something to the effect of the following line be added to the network constructor:
assert len((set([i[0] for i in edge_from.index] + [i[1] for i in edge_from.index])) - set(node_x.index)) <= 0, "Error: edges contain unspecified nodes"```
Hello, many apologies for the delay in this, I've rebuilt my geodb with fresh data and replicating the error. As per @lmnoel I've done a check to see if my edge and node sets are matching and as far as I can tell the nodes and edges are all matching. The data is being retrieved from a pgsql db via pandas, and the data types match the example data and what osm.network_from_bbox() builds.
I can confirm running pandas.to_csv and then pandas.read_csv seems to make the dataframe work without error when running pdna.Network, so something in pandas sql derived dataframe is causing an error despite the correct dtypes. At this point it may well be a bug in pandas for all I know.
Hello again!
This problem keeps coming up when I use the library, so I've worked on a self contained example that replicates the error.
Also note deprecation warnings in log at bottom.
# declare graph as dictionary
edge_dict = {
'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'id_node_source': {0: 1, 1: 1, 2: 2, 3: 3, 4: 4},
'id_node_target': {0: 4, 1: 2, 2: 4, 3: 4, 4: 2},
'distance': {0: 355.91725215477004,
1: 339.0527044990422,
2: 542.0301068103291,
3: 405.7927520128794,
4: 698.3406580590387}}
node_dict = {
'id_node': {0: 1, 2: 2, 3: 3, 4: 4},
'x': {0: 523991.2039019342,
2: 524221.758848412,
3: 523816.78407285974,
4: 524193.69128971046},
'y': {0: 2944562.7472850494,
2: 2944811.345662121,
3: 2944420.40466592,
4: 2944270.042744304}}
# read dictionary into dataframe
edges_topo = pd.DataFrame.from_dict(edge_dict)
nodes_gdf = pd.DataFrame.from_dict(node_dict)
net = pdna.Network(node_x = nodes_gdf["x"],
node_y = nodes_gdf["y"],
edge_from = edges_topo["id_node_source"],
edge_to = edges_topo["id_node_target"],
edge_weights = edges_topo[["distance"]])
C:\Apps\Anaconda\envs\ium\lib\site-packages\pandana\network.py:82: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
nodes_df.astype('double').as_matrix(),
C:\Apps\Anaconda\envs\ium\lib\site-packages\pandana\network.py:83: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
edges.as_matrix(),
C:\Apps\Anaconda\envs\ium\lib\site-packages\pandana\network.py:85: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
.astype('double')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-828612a5b386> in <module>
5 edge_from = edges_topo["id_node_source"],
6 edge_to = edges_topo["id_node_target"],
----> 7 edge_weights = edges_topo[["distance"]])
C:\Apps\Anaconda\envs\ium\lib\site-packages\pandana\network.py in __init__(self, node_x, node_y, edge_from, edge_to, edge_weights, twoway)
85 .astype('double')
86 .as_matrix(),
---> 87 twoway)
88
89 self._twoway = twoway
src\cyaccess.pyx in pandana.cyaccess.cyaccess.__cinit__()
ValueError: Buffer dtype mismatch, expected 'long' but got 'double'
@wa-bhe , could you set nodes_gdf index to "id_node" then try it again? My understanding is that nodes DF need to be properly indexed to work.
Adding an index as you suggested @semcogli creates a Network dataframe successfully. Many thanks!
nodes_gdf.set_index('id_node', inplace= True)
That does make sense, given that the function is expecting a graph created by osmnet.
I could make a PR on the docs to add a generic geodataframe loading section, specifying that the node layer needs to be indexed by the node id?
I recently had the same problem and I figured it was because there were edges referencing non-existing nodes. I fixed by filtering the edges_gdf using this line of code:
edges_gdf = edges_gdf[edges_gdf['to'].isin(nodes_gdf['id_node']) & edges_gdf['from'].isin(nodes_gdf['id_node'])]
nodes.set_index('ID', inplace= True)
worked for me, thanks