Support Analytics Pipelines

Open zprobst opened this issue 1 year ago • 7 comments

Background

One of the primary use cases for using graph databases is the use of analytics and ML workloads.

Requirements / Principles

If nodestream were to support analytics jobs, it would be ideal for it to support the same core principles that the remainder of nodestream supports.

Users should be decoupled from the database allowing them to pick the best database for the job. Therefore the database connector can implement the requisite hooks for operating these analysis alogorithms.
Users should be able to build jobs declaratively.

Implementation Details

Implementation could essentially follow a similar design approach as migrations is taking. The core framework handles as much as is prudent and defers to the database connector (which can optionally support the feature) to perform the actual work of data analysis. Steps like copy and export mentioned below can be implemented using nodestream's existing copy and pipelines features to retrieve and map data.

Example Project File

# nodestream.yaml
scopes:
   # ... for data pipelines 

analyses:
  - analyses/example.yaml

targets:
    anaylitics-graph: 
       # ....
    persistent-graph:
      # ...

Example Analysis File

This example pipeline facilitates the copying of data from persistent-graph to anaylitics-graph. From there it runs some topological analysis algorithms and persists the results back in persistent-graph.

# analyses/example.yaml
phases:
  # Before we can run the analysis, we need to copy the data into the graph. 
  # This step will copy the data from the target specified in nodestream.yaml into the graph.
  # If you are using a persistent graph, you may not need to run this step. 
  - name: Copy Data
    step: copy
    source: persistent-graph
    nodes:
      - Person
    relationships:
      - KNOWS
  
  # Project tells the connector which nodes and relationships to include in the analysis. 
  # For instance, in the case of GDS, this will run a projection. 
  - name: Project Graph
    step: project
    projection:
      nodes:
        - Person
      relationships:
        - KNOWS

  # Next is some example algorithms that we are running. 
  - name: Run Weakly Connected Components
    step: algorithm
    algorithm: weaklyConnectedComponents
    parameters:
      writeProperty: community

  - name: Run Degree Centrality
    step: algorithm
    algorithm: degreeCentrality
    parameters:
      node_types:
        - Person
      relationship_types:
        - KNOWS
      # weightProperty: weight; optional
      writeProperty: degreeCentrality

  # The export step will export the results of the analysis to the specified target.
  # The target must be specified in nodestream.yaml.
  # Internally, this will build a nodestream pipeline to extract the data from the graph and write it to the target.
  - name: Export Results
    step: export
    target: persistent-graph
    nodes:
      - type: Person
        properties:
          - degreeCentrality
          - community

Can be run with nodestream analytics run example --target anaylitics-graph

Feb 02 '24 16:02 zprobst

nodestream nodestream copied to clipboard

Support Analytics Pipelines

Background

Requirements / Principles

Implementation Details

Example Project File

Example Analysis File

nodestream
nodestream copied to clipboard