amundsen RFC/Feature: Nebula Graph as Backend Storage

trafficstars

Similar to https://github.com/amundsen-io/amundsen/issues/526, just to add backend support of Nebula Graph, an Open Source, distributed Graph Database stands out as it's a Linear Scalable, Cloud Native, Open Source(Apache 2.0) GDB, and it speaks OpenCypher and nGQL.

Background

Based on my observation, Nebula Graph is beloved/and adopted as the Graph Infra by many teams in the community due to its excellent OLTP capability for huge data volumes, while, they had independently created their own wheels of metadata service/lineage system on their own on top of Nebula.

Knowing their entropy increasing efforts on modeling the metadata, writing hooks for different data sources to wire everything up, etc(i.e. some maintained their own Giant fork of Apache Atlas with Nebula Graph as backend and are basically unable to upstream), I am thinking of help bring their efforts together yet enable more of Nebula users to start managing their metadata without pain.

Then I see Amundsen, the elegant, community-driven, and beloved project(well done!!!), and had been working to bring Amundsen to the Nebula Graph community.

Expected Behavior or Use Case

It should be the same as it was for Neo4j, AWS Neptune, and Apache Atlas.

Service or Ingestion ETL

Metadata:

Nebula Proxy

Databuilder:

Nebula Extractor
Nebula Search Data Extractor
Nebula CSV Loader
Nebula CSV Publisher
Nebula Serializer
Nebula Sample Data Loader

       ┌────────────────────────┐ ┌────────────────────────────────────────┐
       │                        │ │                                        │
       │ Frontend :5000         │ │ Metadata Sources                       │
       ├────────────────────────┤ │ ┌────────┐ ┌─────────┐ ┌─────────────┐ │
       │ Metaservice :5001      │ │ │        │ │         │ │             │ │
       │ ┌──────────────┐       │ │ │ Foo DB │ │ Bar App │ │ X Dashboard │ │
  ┌────┼─┤ Nebula Proxy │       │ │ │        │ │         │ │             │ │
  │    │ └──────────────┘       │ │ │        │ │         │ │             │ │
  │    │                        │ │ │        │ │         │ │             │ │
  │    ├────────────────────────┤ │ └────────┘ └─────┬───┘ └─────────────┘ │
┌─┼────┤ Searchsearvice :5002   │ │                  │                     │
│ │    └────────────────────────┘ └──────────────────┼─────────────────────┘
│ │                                                  │
│ │    ┌─────────────────────────────────────────────┼───────────────────────┐
│ │    │                                             │                       │
│ │    │ Databuilder     ┌───────────────────────────┘                       │
│ │    │                 │                                                   │
│ │    │ ┌───────────────▼────────────────┐ ┌──────────────────────────────┐ │
│ │ ┌──┼─► Extractor of Sources           ├─► nebula_search_data_extractor │ │
│ │ │  │ └───────────────┬────────────────┘ └──────────────┬───────────────┘ │
│ │ │  │                 │                                 │                 │
│ │ │  │ ┌───────────────▼────────────────┐ ┌──────────────▼───────────────┐ │
│ │ │  │ │ Loader filesystem_csv_nebula   │ │ Loader Elastic FS loader     │ │
│ │ │  │ └───────────────┬────────────────┘ └──────────────┬───────────────┘ │
│ │ │  │                 │                                 │                 │
│ │ │  │ ┌───────────────▼────────────────┐ ┌──────────────▼───────────────┐ │
│ │ │  │ │ Publisher nebula_csv_publisher │ │ Publisher Elasticsearch      │ │
│ │ │  │ └───────────────┬────────────────┘ └──────────────┬───────────────┘ │
│ │ │  │                 │                                 │                 │
│ │ │  └─────────────────┼─────────────────────────────────┼─────────────────┘
│ │ │                    │                                 │
│ │ │                    │                                 │
│ │ └────────────────┐   │                                 │
│ │                  │   │                                 │
│ │    ┌─────────────┼───►─────────────────────────┐ ┌─────▼─────┐
│ │    │ Nebula Graph│   │                         │ │           │
│ └────┼─────┬───────┴───┼───────────┐     ┌─────┐ │ │           │
│      │     │           │           │     │MetaD│ │ │           │
│      │ ┌───▼──┐    ┌───▼──┐    ┌───▼──┐  └─────┘ │ │           │
│ ┌────┼─►GraphD│    │GraphD│    │GraphD│          │ │           │
│ │    │ └──────┘    └──────┘    └──────┘  ┌─────┐ │ │           │
│ │    │ :9669                             │MetaD│ │ │  Elastic  │
│ │    │ ┌────────┐ ┌────────┐ ┌────────┐  └─────┘ │ │  Search   │
│ │    │ │        │ │        │ │        │          │ │  Cluster  │
│ │    │ │StorageD│ │StorageD│ │StorageD│  ┌─────┐ │ │  :9200    │
│ │    │ │        │ │        │ │        │  │MetaD│ │ │           │
│ │    │ └────────┘ └────────┘ └────────┘  └─────┘ │ │           │
│ │    │                                           │ │           │
│ │    ├───────────────────────────────────────────┤ │           │
│ └────┤ Nebula Studio :7001                       │ │           │
│      └───────────────────────────────────────────┘ └─────▲─────┘
│                                                          │
└──────────────────────────────────────────────────────────┘

Possible Implementation

Due to its Directed Property Graph Model and the support of OpenCypher, the implementation is just following that the community had done with the great Neo4j.

The only thing that differentiated is Nebula Graph is Schema-ful, that is, inserting data before the Graph Schema is created is unaccepted. Thus, to decouple the Nebula schema creation of the model, my proposal now was to create/alter Nebula Graph Schema when needed in Nebula CSV Publisher.

I will create my draft PR here: #1817 , and it's tested workable for all functions Neo4j that already supports with Docker Compose on the Frontend.

My branch 👉🏻: https://github.com/wey-gu/amundsen/tree/amundsen_nebula_graph

docker-compose -f docker-Amundsen-nebula.yml build
docker-compose -f docker-Amundsen-nebula.yml up -d

# wait for 90 seconds after all containers are up
cd data builder
python3 -m venv venv
source venv/bin/activate
pip3 install --upgrade pip
pip3 install -r requirements.txt
python3 setup.py install
python3 example/scripts/sample_data_loader_nebula.py

# try to visit this from your browser!
http://localhost:5000/table_detail/gold/hive/test_schema/test_table1

For now, I assume example/scripts/sample_data_loader_nebula.py to be used to bootstrap the schema before any cluster is brought up. Please help advise on better solutions
I learned through documentation and codebase to contribute, maybe I didn't understand things correctly, kindly help correct/teach me if possible :)

I will prepare some articles and videos and explore some real data source pipeline to help guys in the Nebula Graph community(for now, most of the friends are Chinese, me, too! being lockdown in Shanghai these days T__T ) in the upcoming days.

Could you kindly help with advice/review?

Thanks so much!

Why does yet another graph database for Amundsen speak cypher query?

Wey: I love Neo4j, too! I just hope those Nebula Graph lovers(they for sure love Neo4j, too as I know) would have a chance to enjoy Amundsen's amazing offerings on their Nebula Graph clusters.

Example Screenshots (if appropriate):