spark-dgraph-connector
spark-dgraph-connector copied to clipboard
A connector for Apache Spark and PySpark to Dgraph databases.
Loading the dbpedia schema takes 30s. Investigate and try to improve performance. Related to #45.
The Dgraph Java Client uses gRPC which has a concept called deadline for requests. This is a timeout that should be configurable and high enough for long running queries of...
Dgraph Java Client uses gRPC to communicate with the alpha nodes. It supports TLS and https. These should be supported by the connector as well. The connector also fetches some...
Similarly to the wide node source, a data source that supports edges and list properties could be useful. Edges are predicates that have a list of uids as the value....
Check with Dgraph devs if they could add an operation to GraphQL that provides you a sample of the uids that match a query. Retrieving only every `N`-th uid could...
Parsing the JSON result may return null values where required JSON members are expected (e.g. see commit e35680d). Guard against this.
The wide node table schema uses predicate names as columns, allowing injection of arbitrary strings into column names. This should be reviewed and guarded against. For instance, a predicate `subject`...
Selecting a column by name that contains a `.` (dot) confuses Spark: df.select($"dgraph.type") throws this exception: cannot resolve '`dgraph.type`' given input columns: [dgraph.graphql.schema, dgraph.type, ...] The reason is that `dgraph`...
Add a data source that does not read the actual data but provides performance metrics. Each partition sends a query to the Dgraph cluster and retrieved besides the data also...
With the performance data source #10 we can measure various metrics on the partition level. Create some benchmarks running the following queries against a Dgraph cluster. The cluster needs to...