amazon-neptune-tools icon indicating copy to clipboard operation
amazon-neptune-tools copied to clipboard

[csv-to-neptune-bulk-format] : Multiple relational tables as source CSV files

Open Anita-Lavania opened this issue 2 years ago • 3 comments

Hi,

How will this work with multiple source CSV files with foreign key columns to join on? Taking the Spotify dataset as e.g., suppose you have source_track.csv that contains [track_ID, track_name, track_artist, album_ID] and another CSV file source_album.csv that contains [album_ID, album_name, album_release_date]. Now if we want to create track node CSV file, album node CSV file and track->album edge CSV file, including the given properties on both track nodes and album nodes, how will that be possible?

In other words, can we do what is described here?

Thanks in advance.

Anita-Lavania avatar Dec 29 '22 13:12 Anita-Lavania

This can be accomplished by adding multiple files into the fileNames portion of the JSON. For example here is what this might look like for the example you provided:

{
    "source_folder": "source",
    "data_folder": "data",
    "fileNames": [
        "track.csv",        
        "album.csv"
    ],
    "nodes": [
        {
            "csvFileName": "Track.csv",
            "select": "'track_name' in row and 'track_artist' in row and row['track_name'] != '' and row['track_artist'] != ''",
            "id": "uuid()",
            "label": "'Track'",
            "uniqueKey": "row['track_artist'] + '-' + row['track_name']",
            "properties": [
                {
                    "property": "track_name",
                    "key": "track_name"
                },
                {
                    "property": "track_artist",
                    "key": "track_artist"
                },
                {
                    "property": "track_popularity",
                    "key": "track_popularity"
                },
                {
                    "property": "danceability",
                    "key": "danceability"
                },
                {
                    "property": "energy",
                    "key": "energy"
                },
                {
                    "property": "key",
                    "key": "key"
                },
                {
                    "property": "loudness",
                    "key": "loudness"
                },
                
                {
                    "property": "mode",
                    "key": "mode"
                },
                {
                    "property": "speechiness",
                    "key": "speechiness"
                },
                {
                    "property": "acousticness",
                    "key": "acousticness"
                },
                {
                    "property": "instrumentalness",
                    "key": "instrumentalness"
                },
                {
                    "property": "liveness",
                    "key": "liveness"
                },
                {
                    "property": "valence",
                    "key": "valence"
                },
                {
                    "property": "tempo",
                    "key": "tempo"
                },
                {
                    "property": "duration_ms",
                    "key": "duration_ms"
                }
            ]
        },
        {
            "csvFileName": "Album.csv",
            "select": "'track_album_name' in row and row['track_album_name'] != '' and 'track_album_id' in row and row['track_album_id'] != ''",
            "id": "uuid()",
            "label": "'Album'",
            "uniqueKey": "row['track_album_id']",
            "properties": [
                {
                    "property": "name",
                    "value": "row['track_album_name']"
                },
                {
                    "property": "album_release_date",
                    "value": "row['track_album_release_date']"
                }
            ]
        }
    ],
    "edges": [
        {
            "csvFileName": "Track_Album_Edges.csv",
            "select": "'track_name' in row and 'track_artist' in row and row['track_name'] != '' and row['track_artist'] != '' and 'track_album_id' in row and row['track_album_id'] != ''",
            "id": "uuid()",
            "label": "'IN_ALBUM'",
            "from": "row['track_artist'] + '-' + row['track_name']",
            "to": "row['track_album_id']",
            "fromLabel": "'Track'",
            "toLabel": "'Album'",
            "properties": []
        }
    ]
}

bechbd avatar Jan 03 '23 23:01 bechbd

Thanks @bechbd. However, when you do: { "csvFileName": "Album.csv", "select": "'track_album_name' in row and row['track_album_name'] != '' and 'track_album_id' in row and row['track_album_id'] != ''", "id": "uuid()", "label": "'Album'", "uniqueKey": "row['track_album_id']", "properties": [ { "property": "name", "value": "row['track_album_name']" }, { "property": "album_release_date", "value": "row['track_album_release_date']" } ] }

the alum_id, album_name, album_release_date come from the original album.csv. Now when you go ahead and try to make edges, i.e.,

{ "csvFileName": "Track_Album_Edges.csv", "select": "'track_name' in row and 'track_artist' in row and row['track_name'] != '' and row['track_artist'] != '' and 'track_album_id' in row and row['track_album_id'] != ''", "id": "uuid()", "label": "'IN_ALBUM'", "from": "row['track_artist'] + '-' + row['track_name']", "to": "row['track_album_id']", "fromLabel": "'Track'", "toLabel": "'Album'", "properties": [] }

it will produce an empty edge file because there has been no mapping between tracks and albums.

Anita-Lavania avatar Jan 04 '23 04:01 Anita-Lavania

I think that this might be due to the configuration of your input files. When I ran this using a track CSV with the following header the edges were created correctly.

image

bechbd avatar Jan 04 '23 18:01 bechbd