neo4j-apoc-procedures icon indicating copy to clipboard operation
neo4j-apoc-procedures copied to clipboard

Add Apache arrow read / write support

Open jexp opened this issue 5 years ago • 7 comments

To better interop with other frameworks Like spark, kafka, etc. Align format with Graphistry

Consider also streaming the raw format back.

https://www.kdnuggets.com/2018/01/supercharging-visualization-apache-arrow.html

jexp avatar Jun 18 '20 06:06 jexp

https://labs.graphistry.com/docs/docs_api.html#link

conker84 avatar Jun 22 '20 09:06 conker84

@lmeyerov can you point @conker84 to the graph-arrow format that you support in Graphistry

jexp avatar Jun 22 '20 09:06 jexp

Hi @conker84 @jexp sure thing:

-- raw Apache Arrow over REST API: https://hub.graphistry.com/docs/api/#upload2 . Conceptually, JSON bindings file (src: colX, dst: colY, color: colZ, ..), edges.arrow typed columnar properties table, and optional nodes.arrow typed columnar properties table.

-- PyGraphistry client for implementing graphistry.edges(pd.read_csv('...')).bind(source='colX', destination='colY').plot(): https://github.com/graphistry/pygraphistry/blob/master/graphistry/arrow_uploader.py

FWIW, we are able to move ~100M rows in ~1s via disk -> notebook -> graphistry this way. Being able to do that with Neo4j would be a big enabler! https://www.graphistry.com/blog/graphistry-2-29-5-upload-100x-more-rapids-0-13-learnrapids-com-and-more

lmeyerov avatar Jun 26 '20 18:06 lmeyerov

Also, I believe Mark Quinsland was looking at something similar via Parquet. Arrow is designed more for in-memory + streaming, so I'd recommend that, but Parquet achieves similar efficiency & standardization objectives, and while worse for in-memory + streaming, may enable bulk neo4j import/export (for TB/PB-scale datasets instead of GB/TB-scale). In both cases, our guess was the Neo4j<>Spark work might provide a bit of a start, where Arrow would be a more standard typed columnar format than Spark's internal RDD one.

lmeyerov avatar Jun 26 '20 18:06 lmeyerov

@jexp pointed me in this direction...I recently started a project dedicated to experimenting with Arrow Flight support as a plugin in Neo4j.

I'm focused mostly on integration with the GDS in-memory graph, but @conker84 if you're defining a practical way for representing Cypher types like nodes and relationships maybe I can adopt your work? I haven't put much effort into identifying a generic way to support the heterogeneous cypher types. (My goal is speed, so I'm focused mostly on the homogenous structures from GDS.)

Happy to take the discussion off the issue/ticket.

voutilad avatar Aug 29 '21 08:08 voutilad

@voutilad hi I just got back to work, please lemme know how do you want to proceed on this, I'm happy to set up a call! I'll ping you on slack

conker84 avatar Aug 30 '21 09:08 conker84

I’m off until next week (back 7 September). Let’s connect next week.

voutilad avatar Aug 30 '21 09:08 voutilad

Fixed https://github.com/neo4j-contrib/neo4j-apoc-procedures/pull/1859

vga91 avatar Oct 05 '22 16:10 vga91