Normaliser mode: more ways to unnest data where max table nesting does not suffice

Open adrianbr opened this issue 1 year ago • 0 comments

Feature description

A user would like some kind of normaliser mode that keeps every list as complex. https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1714461924910179

There have been different requests for other modes (infer type from string), perhaps we can do something general here.

Are you a dlt user?

None

Use case

https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1714461924910179

  2 hours ago
I'm struggling with unnesting some arbitrary JSON-like data in a way that makes sense for us. What I want to avoid is having tables created for nested arrays / lists, but I want to expand all nested objects into columns otherwise. Hence using the max_table_nesting doesn't work.
Is there a simple way to do this that I'm missing?




9 replies

dltHub bot
APP  [2 hours ago](https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1714461924910179)
Thank you for your question! It helps us improve our docs and product.
@adrian
 will respond to you during GMT+1 working hours :raised_hands:


Aksel Stokseth
  2 hours ago
For some added background context:
We're working on a stream of objects with unknown incoming schema and new schemas can arrive at any time. We split these objects into separate tables based on some properties og the objects, but where these nested arrays might arrive is unknown.


adrian
  [2 hours ago](https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1714461929379589?thread_ts=1714461924.910179&cid=C04DQA7JJN6)
maybe something like this before yielding? https://www.geeksforgeeks.org/python-pandas-flatten-nested-json/

GeeksforGeeksGeeksforGeeks
Python Pandas - Flatten nested JSON - GeeksforGeeks
A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions.
Mar 16th, 2022
:eyes:
1



Aksel Stokseth
  1 hour ago
Thanks for the answer! I'm testing out this approach, but it breaks all of my other assumptions about the data since these transformations happen pre-normalization.


adrian
  1 hour ago
that makes sense. unfortunately dlt does not support anything for this case directly so some kind of pre-processing is needed. Is the problem now with just pre processing in general or particularly with the way pandas handles it? for the former i guess you would need to adjust the assumptions, for the latter we could look into some other way to preprocess depending on what the exact issue is?


Aksel Stokseth
  [1 hour ago](https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1714465622706489?thread_ts=1714461924.910179&cid=C04DQA7JJN6)
I wrote my own simple unnester that flattens the JSON, so now I'm just fighting the framework to figure out what I'm doing wrong :sweat_smile:
In general though, having a simple setting for always treating lists as complex would also have solved my use case.
:bulb:
1



Aksel Stokseth
  [1 hour ago](https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1714465822984669?thread_ts=1714461924.910179&cid=C04DQA7JJN6)
Initially thought I'd be able to use preferred_types , but this only matches on exact column names / regexes. (edited) 


Aksel Stokseth
  1 hour ago
I've gotten it to work "as expected" now, but of course it's a bit cumbersome to now have essentially two "normalization engines" in play at the same time.
First normalize the JSON with my homegrown normalizer, then dlt goes to work on the objects again afterwards.


Aksel Stokseth
  45 minutes ago
I guess I can opt out of the dlt one, but ideally I would like dlt to handle all of the normalization :blush:

Proposed solution

No response

Related issues

No response

Apr 30 '24 09:04 adrianbr