batch-import
batch-import copied to clipboard
Configurable relationship discovery
In the case of importing several node CSVs (with e.g. 1 CSV <-> 1 table "dump"), it would be extra nice to have batch-import figure out what the relationships are.
Manually specifying ALL relationships between nodes created from different tables is quite error-prone.
What about a new parameter like the last one in this configuration example :
batch_import.nodes_files=file1.csv,file2.csv,file3.csv batch_import.discoverable_links=file1.csv ref_column file2.csv id,file2.csv ref_column file3.csv id
That means: automatically create between nodes whose:
- (from file1.csv) ref_column equals the id of file2.csv nodes
- (from file2.csv) ref_columns equals the id of file3.csv nodes
I guess these attributes could be removed from the nodes, the ref_column would become the name of the relationship.
If one of the file or column does not exists, an error would be thrown.
If batch_import.rels_files
is set, then it has highest precedent on this new property.
What do you think about it?
This will be backwards compatible and I can develop it soon (as I actually need it).
I need this too! Does it exist?
Could you provide a concrete example?
The problem I see with that, is that people just recreate the relational database structure in the graph and don't think about creating a graph model that is actually better suited for their use cases.
I do have something in the work that delivers something like a "relationship discovery".
I am currently not sure if there is a broader interest in this topic or not, so I am not sure to make it open source or not.
I would like to discuss that.
In my opinion the batch-importer is designed for initial load. The design is for massive bulk load, optimized for very high performance and high data volumes.
My design is comparable to an ETL tool - simple - but specialized for semantic structures, data blending and so on. The current design is suitable for <20k nodes with 50 attributes each.
Any suggestions? I would be quite interested to hear about your needs, wishes, requirements, ideas etc.
On 11.04.2014, at 22:02, Michael Hunger [email protected] wrote:
Could you provide a concrete example?
The problem I see with that, is that people just recreate the relational database structure in the graph and don't think about creating a graph model that is actually better suited for their use cases.
— Reply to this email directly or view it on GitHub.
I would use that.
For example, I have a case study where I have a set of identifiers for people in one column, and another set for recommenders in another column. Most of the people in the first column have at least one recommender in the second column. Some of the recommenders also recommended other people, and some of the people also recommended other people.
I would like to know how far these connections go, i.e. someone -> [:recommends] -> someone else -> [:recommends] -> someone else etc.
My data set is a small (~2100 rows) csv file, and I only know how to do part of this with a relational database, but it seems like the perfect problem for a graph database, no?
In a different kind of example, I would like to improve safe bicycle route recommendations in my city. So if I could input intersections (nodes) and descriptions of each of the streets that join them (which would become the relationships), it would be something like location -> [:direction] -> location -> [:direction] -> location. The attributes would be things like starting/ending latitude & longitude, how steep the street is, whether there is a dedicated bike lane or not, etc.
Can your tool handle either or both of those types of situations?