pygraphistry
pygraphistry copied to clipboard
Feature request on native support for Dgraph database.
I see the project already supports Neo4j and Tiger graph for accessing data. I was wondering if folks would be interested in supporting Dgraph (github) as well which is an open-source fast, distributed, and transactional database. I work with Dgraph happy to help with this addition to pygraphistry. Some of our users have showed interest in visualization capabilities that pygraphistry offers.
Hi @anurags92 !
Happy to help get that landed, especially as we're getting to launch a cloud tier that will help them get going faster. Our plugins are fairly short in practice, basically we just need to implement some methods dgraph_auth()
, dgraph_query_to_nodes_and_edges_dataframes()
, and ideally via the optimized Apache Arrow binary format.
Maybe we collaborate via a https://colab.research.google.com/ notebook on the above methods for some dgraph sandbox DB, and I shepherd it into the main release from there?
Hi @lmeyerov, apologies for the delay. Been keeping really busy. Maybe I didn't understand, but do you want me to expose a sandbox db in one of the notebooks over there?
Hi @anurags92 ! Yes, having a live dgraph instance + colab notebook to collaborate on would help, and especially an example of going from query to dataframe with it.
As some recent updates on our side that should make this easier & faster:
-
Graphistry Hub launched, starting with free dev API accounts: https://www.graphistry.com/get-started
-
Our 2.0 upload API is out, which supports fast binary uploads via Apache Arrow:
-- https://www.graphistry.com/blog/graphistry-2-29-5-upload-100x-more-rapids-0-13-learnrapids-com-and-more
-- REST: https://hub.graphistry.com/docs/api/#upload2
-- ... via Python: https://github.com/graphistry/pygraphistry/blob/master/graphistry/plotter.py#L700 + https://github.com/graphistry/pygraphistry/blob/master/graphistry/arrow_uploader.py
(auto-coercions when doing graphistry.edges(pd.read_csv('...'))
, graphistry.edges(cudf.read_csv('...'))
, etc.)
My thinking is we start with a simple one via ^^^ for PyGraphistry, just need a notebook sandbox we can collaborate in, and then look at another popular client like JS.
@anurags92 Just wanted to ping on this.
A good first step may be a sample notebook of doing a notebook of dgraph query -> graphistry viz, even before being built-in
@lmeyerov Apologies for being MIA. I had taken a hiatus from dgraph. Since I am back, I am looking to clear out old items. This looks like an easy win. I have a dgraph setup with DQL query. Since we first discussed on this, dgraph is now available on cloud via it's own offering at cloud.dgraph.io. Let me know if you'd still be interested in collaborating on landing this in.
Great, this would be of high interest!
Things are advancing a bit as we prep for ChatGPT support + our no-code SaaS launch, but in both cases, the work starts with the above. For a dgraph cloud demo dataset, can you start a notebook that does query-> node+edge pandas dataframe? If an official dgraph python maintains that step, even better!
- @DataBoyTX for visibility
@lmeyerov I have a collab notebook setup here. It has very minimal data and query. We can start working on this. Let me know the next steps.
I made a pass last weekend but wasn't able to get data out of the db instance in the google colab -- let me pass to folks here to see if someone can riff on it. I think the next step is drop the results into a dataframe:
Does dgraph support introspection of db schema for datatypes, vs just json? Ex: to know some field is a timestamp. Ultimately, we want to get the data into pandas/cudf node/edges dataframes, ideally with apache arrow conformant datatypes for fast & safe processing. Maybe there's another way manually, or a client library for other users doing dgraph <> dataframes we can align on (as we found simplifies stability long-term)?
Hey Everyone. I'm with Dgraph also and am exploring using Graphistry for visualization of a large Dgraph cluster at an upcoming conference. I've managed to get @anurags92 's notebook (I made a copy) connecting to a Dgraph cloud instance and I managed to transform the Dgraph query result into your
{
"graph": [
...
],
"bindings": {
...
},
"labels": [
...
]
}
format that I've seen in the docs with regard to graphing JSON. I couldn't find a pygraphistry function however that would render this JSON format. Probably something obvious....
If you want to see the updated notebook, I've opened it up here: https://colab.research.google.com/drive/1EDv8IFNI-A6cqqbVArGNGyEMOK6BYZ0i?usp=sharing
Note that this notebook uses getpass for obtaining both the Dgraph cloud api key and the Graphistry account password, so you'll need to hit me up privately if you want to run it.
We'd be delighted to hop on a call with Graphistry to iron this out and maybe explore how we can get a native Dgraph data connector integrated with pygraphistry.
@matthewmcneely awesome
If you're in pandas, you can directly load JSON in if it's already flat:
import pandas as pd
people = [ {'first_name': 'a', 'age': 20}, {'first_name': 'bb', 'age': 30} ]
nodes_df = pd.DataFrame(people)
#repeat for edges
edges_df = ...
import graphistry
graphistry.nodes(nodes_df, 'name').edges(edges_df, 'user_1_name', 'user_2_name').plot()
There are a bunch of flattening tricks if the data isn't flat yet, e.g., https://towardsdatascience.com/how-to-convert-json-into-a-pandas-dataframe-100b2ae1e0d8
Note: Underneath, graphistry will covert the pandas dataframe to Apache Arrow pre-upload, so for bigger graphs, this ends up being quite fast for the python->graphistry side, even on 1M row files, and I'm guessing there are similar tricks to make dgraph->python snappy too!
Ping leo@<our site . com> + tcook@, and we can help out?
@lmeyerov
Thanks for your guidance. I was able to get a basic graph working. I had to write a custom python parser to extract nodes and edges from our JSON results. It's no where near complete, but a good starting point. BTW, the notebook is updated.
Awesome!
I should also share, when dgraph returns "just" a table, vs a node table & edge table, another cool binding here can be the .hypergraph()
:
df = pd.read_csv('logs.csv')
# Extract & connect unique values from columns src_ip, dst_ip, alert col
# and choose whether to make a node for each row or not.
# Remaining table columns appear as attributes
g1 = graphistry.hypergraph(
df,
['src_ip', 'dst_ip', 'alert'],
direct=True
)['graph']
g1.plot()
# Control options like which edges to generate and which IDs live in the same namespace
g2 = graphistry.hypergraph(
df,
['src_ip', 'dst_ip', 'alert'],
direct=True,
opts={
'CATEGORIES': {
'ip': ['src_ip', 'dst_ip']
},
'EDGES': {
'src_ip': ['dst_ip'],
'alert': ['src_ip', 'dst_ip']
}
})['graph']
g2.plot()
This may be more interesting to reexamine as louie.ai goes to more cohorts & reaches GA