duckdb_aws
duckdb_aws copied to clipboard
Unnest DynamoDB JSON format
Hello DuckDB team!
I'm not sure if this repo is the best place to discuss this, since it's not something that strictly depends on the AWS SDK. AFAIK Dynamo unmarshaling is not in the C++ SDK at least^1. However, it is quite related to AWS usage in general...
I'm trying to run some ETL job on a DynamoDB table previously exported to S3^2 as NDJSON files. DuckDB will happily read it, but then I'm stuck with this fairly impractical schema generated by DynamoDB.
D select PK, SK, __typename, isMarketOpen from main limit 5;
┌────────────────────────────────────────────────────┬──────────────────────────────────────────────┬──────────────────────────────┬──────────────────────┐
│ PK │ SK │ __typename │ isMarketOpen │
│ struct(s varchar) │ struct(s varchar) │ struct(s varchar) │ struct(bool boolean) │
├────────────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────┼──────────────────────┤
│ {'S': OBJECT#e036c748-61bb-4431-96da-eeb340c96ec7} │ {'S': QUOTE#NA9#XETR} │ {'S': Quote} │ {'BOOL': true} │
│ {'S': USER#Google_101534313370849411321} │ {'S': #PROFILE#Google_101534313370849411321} │ {'S': User} │ │
│ {'S': OBJECT#ao_01H9QN66R50VDVAMZ79CXRHNE9} │ {'S': OBJECT} │ {'S': ActivityStreamsObject} │ │
│ {'S': OBJECT#457acd68-8900-4083-92c7-e3891c39648d} │ {'S': OBJECT} │ {'S': Stock} │ │
│ {'S': OBJECT#457acd68-8900-4083-92c7-e3891c39648d} │ {'S': QUOTE#DEFAULT} │ {'S': Quote} │ {'BOOL': true} │
└────────────────────────────────────────────────────┴──────────────────────────────────────────────┴──────────────────────────────┴──────────────────────┘
This will include some maps, lists, and sets that should be recursively unnested into DuckDB structs and lists to process them effectively.
AWS Glue supposedly offers a DynamicFrame transformation to do the same^3 (it doesn't actually work when it comes to nested objects, which is where it's needed the most... but that's a different story)
I think this could be an extension function in DuckDB. What do you think? I'd be happy to work on it if you agree on where it should be implemented.
Thanks! Giorgio
Hey @borgoat!
This seems like a useful utility function to me, that should live in the AWS extension! If you would like to add it, i can definitely review and merge.
Hello, any update on this feature ? I'm looking to replace pyspark with duckdb and one use case would require duckdb to be able to read the dynamodb format.
I'm currently working on a DynamoDB extension that will support unnesting of DDB JSON, among other things.
That said, I'd like to gather some requirements. 😎
AWS Glue supposedly offers a DynamicFrame transformation to do the same3 (it doesn't actually work when it comes to nested objects, which is where it's needed the most... but that's a different story)
@borgoat Could you please elaborate on Glue's shortcomings? Did you also try simplifyDDBJson?
Hello, any update on this feature ? I'm looking to replace pyspark with duckdb and one use case would require duckdb to be able to read the dynamodb format.
@NicoBrg-WDG Could you please share your current workflow? In particular, how does PySpark access DynamoDB data in your setup? Via S3 exports?
@mlafeldt that's awesome ! thanks for doing this
First of all, I didn't know about this simplifyDDBJson and I will definitely look into it, thank you for this as well.
Second, our current workflow exports DynamoDB data to S3 using boto3's export_table_to_point_in_time in a Glue job, then PySpark reads it and does some transformations that looks like boilerplate to me and I'd loved to get rid of that code.