dynamo-pandas
dynamo-pandas copied to clipboard
Tables with GSI & LSI?
Hi, firstly this package looks like it could really make my life easier, so thanks for putting the time in! i'm not a dynamoDB expert, so sorry if this is a stupid error on my part. I'm receiving a client error when working with 'get_df' on dynamo tables that have either GSI or LSI: "An error occurred (ValidationException) when calling the BatchGetItem operation: The provided key element does not match the schema"
Following your examples, it's working for all tables that dont have a GSI or LSI, should i be using a different "keys" / query structure for those tables?
Hi @SecretSquirrel-origami, Thanks for the feedback.
I haven't tried using tables with GSI or LSI yet. I'll try to reproduce the issue when I get a minute.
After review of the boto3 DynamoDB docs, it appears that the DynamoDB resource.batch_get_item function used by the get_df function does not support GSI or LSI.
The client.query, client.scan, table.query & table.scan support the IndexName parameter required to specify an alternative index.
If that can help, you can use boto3 functions to query your GSI or LSI and still benefit from the data type conversion classes in the dynamo_pandas.serde module as in the example below.
-
Generate a table with a Local Secondary Index with data from the elections dataset included in the plotly library.
import boto3 import botocore import pandas as pd import plotly import dynamo_pandas as dp # Load the dataset df = plotly.data.election() # Create the table table_name = "elections" ddb_client = boto3.client("dynamodb") ddb_client.create_table( TableName=table_name, AttributeDefinitions=[ { "AttributeName": "district", "AttributeType": "S", }, { "AttributeName": "winner", "AttributeType": "S", }, ], KeySchema=[ { "AttributeName": "district", "KeyType": "HASH", } ], GlobalSecondaryIndexes=[ { "IndexName": "winner", "KeySchema": [ { "AttributeName": "winner", "KeyType": "HASH", }, ], "Projection": { "ProjectionType": "ALL" }, "ProvisionedThroughput": { 'ReadCapacityUnits': 5, 'WriteCapacityUnits': 5 }, }, ], ProvisionedThroughput={ 'ReadCapacityUnits': 5, 'WriteCapacityUnits': 5 }, ) # Put the data in the table dp.put_df(df, table=table_name) -
Query the Local Secondary Index
from boto3.dynamodb.conditions import Key ddb_resource = boto3.resource("dynamodb") table = ddb_resource.Table(table_name) items = table.query( IndexName="winner", KeyConditionExpression=Key("winner").eq("Joly"), )["Items"]This will return items with numerical values as
Decimaltype which, when loaded in a pandas dataframe haveobjectdtype instead of integers / floats:[{'winner': 'Joly', 'Joly': Decimal('1908'), 'district_id': Decimal('43'), 'district': '43-Fort-Rolland', 'total': Decimal('4438'), 'result': 'plurality', 'Coderre': Decimal('1325'), 'Bergeron': Decimal('1205')}] -
Convert the data types using the
dynamo_pandas.serdeserialization/deserialization classes and load into a pandas dataframefrom dynamo_pandas.serde import TypeSerializer, TypeDeserializer ts = TypeSerializer() td = TypeDeserializer() items = td.deserialize(ts.serialize(items)) new_df = pd.DataFrame(items)new_dfnow hasint64dtype for the numerical columns.<class 'pandas.core.frame.DataFrame'> RangeIndex: 14 entries, 0 to 13 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 winner 14 non-null object 1 Joly 14 non-null int64 2 district_id 14 non-null int64 3 district 14 non-null object 4 total 14 non-null int64 5 result 14 non-null object 6 Coderre 14 non-null int64 7 Bergeron 14 non-null int64 dtypes: int64(5), object(3) memory usage: 1.0+ KB
I'll think of how I could implement a new query_df function that would leverage the boto3 table.query function and allow users to specify the index to use (IndexName parameter) and the query parameters (KeyConditionExpression parameter).
This function would have the advantage (over plain boto3 calls) of automatically handling paging and data type conversion of results.
@SecretSquirrel-origami, let me know if you think this would be useful.
Thanks for taking the time to investigate this further, your suggested work around using boto3 functions & 'table.query' is what I had to do, although your use of 'dynamo_pandas.serde' is a useful snippet which which would also help in the short term.
Regarding a new "query_df" function, yes in my opinion that would be super useful. Thanks again!
I use a dynamo sort key for working with time series data and being able to transparently transfer time series data between a pandas data frame and dynamo would be really useful (and amazing). Is there any way you could implement features that allow get/put using partition + sort key so working with dynamo time series data would be easier. i.e. query KeyConditions. e.g. eq | le | lt | ge | gt | begins_with | between.
@chrismrutherford, thanks for the feedback and suggestion. I'll consider including a key_condition parameter in the new query_df function I proposed above.
In the meantime, as I suggested to @SecretSquirrel-origami above, you can still leverage some of the library's functionalities to perform the data types conversions for you.
In reference to the ValidationError reported by @SecretSquirrel-origami:
I'm receiving a client error when working with 'get_df' on dynamo tables that have either GSI or LSI: "An error occurred (ValidationException) when calling the BatchGetItem operation: The provided key element does not match the schema"
Beyond the addition of a query_df function as discussed above, a few improvements could be made:
- Clarify the documentation of the
get_df,transactions.get_itemandtransaction.get_itemsfunctions to indicate that thekeysorkeyparameters apply only to the primary key. - Potentially intercept the
ValidationErrorand add the information that thekeysorkeyparameter only work with the primary key. Also refer to thequeryfunction(s) when available.