aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

Add ability to write DF as a bulk load job in Amazon Neptune

Open bechbd opened this issue 2 years ago • 5 comments

Is your idea related to a problem? Please describe. Add ability to load/update data from a data frame via the bulk loader.

Describe the solution you'd like The fastest way to load/update data in Neptune is to use the Bulk Loader. I would like to see a method for the Neptune integration that would take a dataframe, write it out to the supported file type (CSV/n-quads/n-triples/TTL) for the data model (LPG/RDF) and then trigger and monitor a bulk load of this process. P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

bechbd avatar Apr 11 '22 17:04 bechbd

Great idea! Where are files usually staged for such an operation? Is it S3 like for Redshift COPY?

jaidisido avatar Apr 12 '22 09:04 jaidisido

This is something my team would be interested in as well. The Neptune Bulk Loader is all about S3 https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load.html

There's basically 2 APIs

  1. One to request the loading. This returns a load id
  2. One to check the load status given a load id

Based on https://github.com/awslabs/aws-data-wrangler/blob/main/awswrangler/neptune/client.py, I can envision implementing the methods there.

Here's my attempt to test if things would work having already initialized a client - and they do

Create a load request - a potential implementation for load method

data = {
  "source" : "<s3 path>",
  "format" : "nquads",
  "iamRoleArn" : "<role arn>",
  "mode": "AUTO",
  "region" : "us-west-2",
  "failOnError" : "TRUE",
  "parallelism" : "MEDIUM"
}

url = f"https://{client.host}:{client.port}/loader"
req = client._prepare_request("POST", url, data=data)
res = client._http_session.send(req)

Query load status - a potential implementation for load_status method

load_id = res.json()["payload"]["loadId"]
urlStatus = f"https://{client.host}:{client.port}/loader/{load_id}"
reqStatus = client._prepare_request("GET", urlStatus, data="")
resStatus = client._http_session.send(reqStatus)
resStatus.json()

humanzz avatar May 05 '22 16:05 humanzz

Does make sense to me 👍

kukushking avatar May 09 '22 10:05 kukushking

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar Jul 08 '22 12:07 github-actions[bot]

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar Sep 12 '22 12:09 github-actions[bot]