amazon-neptune-tools
amazon-neptune-tools copied to clipboard
[csv-to-neptune-bulk-format] : Multiple relational tables as source CSV files
Hi,
How will this work with multiple source CSV files with foreign key columns to join on? Taking the Spotify dataset as e.g., suppose you have source_track.csv that contains [track_ID, track_name, track_artist, album_ID] and another CSV file source_album.csv that contains [album_ID, album_name, album_release_date]. Now if we want to create track node CSV file, album node CSV file and track->album edge CSV file, including the given properties on both track nodes and album nodes, how will that be possible?
In other words, can we do what is described here?
Thanks in advance.
This can be accomplished by adding multiple files into the fileNames
portion of the JSON. For example here is what this might look like for the example you provided:
{
"source_folder": "source",
"data_folder": "data",
"fileNames": [
"track.csv",
"album.csv"
],
"nodes": [
{
"csvFileName": "Track.csv",
"select": "'track_name' in row and 'track_artist' in row and row['track_name'] != '' and row['track_artist'] != ''",
"id": "uuid()",
"label": "'Track'",
"uniqueKey": "row['track_artist'] + '-' + row['track_name']",
"properties": [
{
"property": "track_name",
"key": "track_name"
},
{
"property": "track_artist",
"key": "track_artist"
},
{
"property": "track_popularity",
"key": "track_popularity"
},
{
"property": "danceability",
"key": "danceability"
},
{
"property": "energy",
"key": "energy"
},
{
"property": "key",
"key": "key"
},
{
"property": "loudness",
"key": "loudness"
},
{
"property": "mode",
"key": "mode"
},
{
"property": "speechiness",
"key": "speechiness"
},
{
"property": "acousticness",
"key": "acousticness"
},
{
"property": "instrumentalness",
"key": "instrumentalness"
},
{
"property": "liveness",
"key": "liveness"
},
{
"property": "valence",
"key": "valence"
},
{
"property": "tempo",
"key": "tempo"
},
{
"property": "duration_ms",
"key": "duration_ms"
}
]
},
{
"csvFileName": "Album.csv",
"select": "'track_album_name' in row and row['track_album_name'] != '' and 'track_album_id' in row and row['track_album_id'] != ''",
"id": "uuid()",
"label": "'Album'",
"uniqueKey": "row['track_album_id']",
"properties": [
{
"property": "name",
"value": "row['track_album_name']"
},
{
"property": "album_release_date",
"value": "row['track_album_release_date']"
}
]
}
],
"edges": [
{
"csvFileName": "Track_Album_Edges.csv",
"select": "'track_name' in row and 'track_artist' in row and row['track_name'] != '' and row['track_artist'] != '' and 'track_album_id' in row and row['track_album_id'] != ''",
"id": "uuid()",
"label": "'IN_ALBUM'",
"from": "row['track_artist'] + '-' + row['track_name']",
"to": "row['track_album_id']",
"fromLabel": "'Track'",
"toLabel": "'Album'",
"properties": []
}
]
}
Thanks @bechbd.
However, when you do:
{ "csvFileName": "Album.csv", "select": "'track_album_name' in row and row['track_album_name'] != '' and 'track_album_id' in row and row['track_album_id'] != ''", "id": "uuid()", "label": "'Album'", "uniqueKey": "row['track_album_id']", "properties": [ { "property": "name", "value": "row['track_album_name']" }, { "property": "album_release_date", "value": "row['track_album_release_date']" } ] }
the alum_id, album_name, album_release_date come from the original album.csv. Now when you go ahead and try to make edges, i.e.,
{ "csvFileName": "Track_Album_Edges.csv", "select": "'track_name' in row and 'track_artist' in row and row['track_name'] != '' and row['track_artist'] != '' and 'track_album_id' in row and row['track_album_id'] != ''", "id": "uuid()", "label": "'IN_ALBUM'", "from": "row['track_artist'] + '-' + row['track_name']", "to": "row['track_album_id']", "fromLabel": "'Track'", "toLabel": "'Album'", "properties": [] }
it will produce an empty edge file because there has been no mapping between tracks and albums.
I think that this might be due to the configuration of your input files. When I ran this using a track CSV with the following header the edges were created correctly.