morpheus icon indicating copy to clipboard operation
morpheus copied to clipboard

How does morpheus stores the graph on HDFS?

Open kant111 opened this issue 4 years ago • 4 comments

How does Morpheus stores the graph on HDFS? does it layout the same way as neo4j would on disk?

I am just trying to figure out how to store and retrieve a graph from HDFS?

kant111 avatar Oct 01 '19 00:10 kant111

In Morpheus, all file system based data sources (csv, parquet, orc) store graphs in a well known directory structure, which is:

└── <graphName>
    ├── capsGraphMetaData.json (generated, contains some version infos)
    ├── propertyGraphSchema.json (generated, required to read from disk)
    ├── nodes
    │   ├── <Label_A>
    │   │   └── table.[csv|orc|parquet]
    │   └── <Label_B>
    │       └── table.[csv|orc|parquet]
    └── relationships
        └── <RelType_A>
            └── table.[csv|orc|parquet]

For CSV, you can find an example here: https://github.com/opencypher/morpheus/tree/master/morpheus-examples/src/main/resources/fs-graphsource/csv/products

s1ck avatar Oct 01 '19 15:10 s1ck

Thanks a lot! That example is great.

Looks to me that relationships are stored in a table with src and destination columns while neo4j stores all the connected edges as a doubly-linked list for efficient retrieval (avoiding all the JOINS required for queries like given a node get all the children or more generally get all connected vertices etc). I wonder if something like that is possible with HDFS?

If relationships are stored in a table with src and destination columns then queries like given a node get the entire subgraph up to say 10 levels deep would require lot of JOINS isn't it? sure it is parallelizable but distributed joins can incur lot of costs depending on how much data we need to move etc. at very least this is not something we can use for OLTP. isn't it? more for OLAP looks like?

kant111 avatar Oct 01 '19 18:10 kant111

You're right. The storage layout is very different from Neo4j. With Morpheus, we had a relational abstraction in mind. As you correctly highlighted, this requires two join operations for a 1-hop traversal ((:A)-[:B]->(:C)), which - for deep traversals / complex patterns - might end up producing large intermediate results. Storing adjacency list structures in a schema-fixed tabular representation is a possibility and would potentially save us 1 join per hop. However, in Spark, you cannot just do pointer chasing throughout the DataFrames, each "hop" from one DataFrame (adjacency list / rels) to another (nodes / intermediate result) requires a join.

Morpheus - in contrast to Neo4j - has a focus on global queries (OLAP), data integration (Property Graph Data Sources) and handling multiple graphs (Cypher 10 features + graph catalog).

There are some optimizations, like indexed DataFrames and Multi-way-join-algorithms that would help us a lot, however, they are not part of Spark (yet).

s1ck avatar Oct 02 '19 06:10 s1ck

JOINS on a distributed database looks like a well-studied problem so I am not really sure if any JOIN algorithm can be efficient for deep traversals. Do you know any of any distributed database that can do this efficiently? It sounds counterintuitive to me that a JOIN can be efficient once the data is distributed across nodes unless everything is predetermined such as join keys, collocation of the rows with the same join key, etc.

If we are stuck with this table thing may be nested sets or nested interval looks much better but I don't know of any open-source or commercial database that had successfully implemented these models.

kant111 avatar Oct 02 '19 07:10 kant111