aws-sdk-pandas
aws-sdk-pandas copied to clipboard
Documentation missing information on required schema for dataframe storage in DynamoDB
Is your idea related to a problem? Please describe.
The examples provided for DynamoDB are far too brief to be of actual use. It would be easiest if there was a function create_table or whatever that could take a dataframe as input and create an appropriate storage table. At minimum, please provide a few examples of dataframes and their corresponding table setups. I get large stacktraces that don't actually help me get to the root of data modeling:
Traceback (most recent call last):
File "/Users/fms/.pyenv/versions/3.9.6/lib/python3.9/site-packages/flask/app.py", line 2070, in wsgi_app
response = self.full_dispatch_request()
File "/Users/fms/.pyenv/versions/3.9.6/lib/python3.9/site-packages/flask/app.py", line 1515, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Users/fms/.pyenv/versions/3.9.6/lib/python3.9/site-packages/flask/app.py", line 1513, in full_dispatch_request
rv = self.dispatch_request()
File "/Users/fms/.pyenv/versions/3.9.6/lib/python3.9/site-packages/flask/app.py", line 1499, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/Users/fms/build/fatpitch/app.py", line 113, in puts
wr.dynamodb.put_df(df=df, table_name=USERS_TABLE, boto3_session=boto3_session)
File "/Users/fms/.pyenv/versions/3.9.6/lib/python3.9/site-packages/awswrangler/dynamodb/_write.py", line 146, in put_df
put_items(items=items, table_name=table_name, boto3_session=boto3_session)
File "/Users/fms/.pyenv/versions/3.9.6/lib/python3.9/site-packages/awswrangler/dynamodb/_write.py", line 183, in put_items
_validate_items(items=items, dynamodb_table=dynamodb_table)
File "/Users/fms/.pyenv/versions/3.9.6/lib/python3.9/site-packages/awswrangler/dynamodb/_utils.py", line 54, in _validate_items
raise exceptions.InvalidArgumentValue("All items need to contain the required keys for the table.")
awswrangler.exceptions.InvalidArgumentValue: All items need to contain the required keys for the table.
Describe the solution you'd like
Ideally, create_table and create_table_json functions which would take a dataframe as input and either create the table or give the necessary schema information. Documentation of how dataframes should be laid out, and how the tables can be laid out to correspond with this. A few examples. Useful error messages such as "I didn't find a 'PK' entry in your dataframe, so I can't index it against this DynamoDB table."
The error message means that your df is missing a column that you defined as a key when creating the table. Here's an example including creating the table:
import boto3
import awswrangler as wr
import pandas as pd
# Define df
df = pd.DataFrame({
"key": [1, 2],
"value": ["foo", "boo"]
})
# Create table
dynamo = boto3.client("dynamodb")
dynamo.create_table(
TableName="test",
KeySchema=[{"AttributeName": "key", "KeyType": "HASH"}],
AttributeDefinitions=[{"AttributeName": "key", "AttributeType": "N"}],
ProvisionedThroughput={
'ReadCapacityUnits': 10,
'WriteCapacityUnits': 10
})
# Insert
wr.dynamodb.put_df(df=df, table_name="test")
I agree having a create_table function, and perhaps even allow creating the table on wr.dynamodb.put_df & others, if the table doesn't exist, would be convenient. Note you still would have to define the keys though so there's no way around slightly painful data modelling stage. Something along the lines of:
wr.dynamodb.create_table(
table_name="test",
keys=[{"AttributeName": "key", "KeyType": "HASH"}],
)
wr.dynamodb.put_df(
df=df,
table_name="test",
keys=[{"AttributeName": "key", "KeyType": "HASH"}], # Create table with following keys if doesn't exist
)
wr.dynamodb.put_json
...
This issue requires triage and should be assigned.
Is help still needed for that?
@snikolakis New methods for dynamo mentioned by @kukushking above are not implemented or on the roadmap at this time, so any contributions would be welcomed!