transformations tutorial
Description
A first tutorial for our dlt+ python based transformations
Deploy Preview for dlt-hub-docs ready!
| Name | Link |
|---|---|
| Latest commit | ed5cb92e7bbb8b6a4306c4cedb4609001f15e7ce |
| Latest deploy log | https://app.netlify.com/sites/dlt-hub-docs/deploys/67d813def617900008d166c6 |
| Deploy Preview | https://deploy-preview-2401--dlt-hub-docs.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
wondering why the python version here is 3.9, although the project level python is >= 3.10 (in general, not with regards to this PR 👀)
[tool.mypy]
python_version="3.9"
general feedback
+I do like the feature a lot. its powerful and succint. sync_source() feels like magic! I never want to live without it again.
+the tutorial arrived swiftly and in a neat packaging (just kidding)
+It was easy to do though, and the transformations interface is pretty clear.
+Its great that there are many hints about stuff that will be supported/come out in the future. (although sometimes its in a tip, sometimes in the text, sometimes a bulleted list). Especially the star-schema template sounds like a really cool idea!
+the sql vs python section is pretty clear to me!
+doing the homework was fun! Its nice to accomplish that much with so few lines of code
-Tying into that, I always think the better the code-snippets that illustrate individual concepts tie into one another the better, kind of like hooking into the power of storytelling (if u get what i mean). (Counterexample, is when syncing with just a few tables, makes no visible change, because the table is already there.) It's not obvious how to do that, but i'd also volunteer to do it
-I am missing a general introductory definition of what transformations are
(my attempt: operations that modify, enrich or restructure raw data using sql or python expressions) that act as close as possible to your data and are thus very fast. wrapped in the dlt_plus.transform they do....). Quite possibly its obvious to dlt-users, but I still think that being more self-contained knowledge-wise (vs requiring much implicit knowledge) is a good value, potentially for newcomers (if we aim to be the defacto standard) and possibly as well for LLMs.
One easy sentence would be to mention where they go beyond what @dlt.transformers can define.
-I found no bugs, everything just worked, except when i tried to pipe a dlt_plus.transform into a dlt.transformer (when I read that dlt_plus.transform with python is a resource under the hood I just had to try it) -> i tried to pipe a dlt_transform into a dlt.transformer but failed... maybe I did something wrong, here is a gist with it: https://gist.github.com/djudjuu/428fb5567d3846d5692213ad8ff665a4
- Is the way the transformations are defined clear?
- Yes, it is super clear.
- Is it clear what lineage does?
- I thought this was about
from wherea specific column stems, not which hints it inherits. Therefore, I was a bit confused.
- I thought this was about
- Is it clear how SQL and Python transformations differ, also from a technical point of view?
- Yes, it is clear as well.
- Is it clear what the sync source does?
- Yes.
- Are there any obvious features missing?
- I don't really have a meaningful input on this.
On homework:
I started doing the homework by defining two pipelines for secure and public data with dlt.yml. And then I was confused as to how datasets are defined, so I went to the source code and was able to do it. Then, for some reason the cache was created as duckdb database even though I seemed to have set the destination to filesystem - but I'm pretty sure it was because of some leftover metadata from previous tries. However, I wasn't 100% sure how to manually clean up everything and it just kept persisting. Then, I couldn't continue with the sync step because I realized the idea (presumably) was just to use python scripts, not necessarily the project yml combined with ci. After this realization, the homework seemed very clear.
Might be a good idea to add very explicit examples for the whole dlt+ docs - but then again, I'm not used to yml projects, which explains my obliviousness.
Thought: In the docs, we should avoid defining the same thing for value and key. Say in dlt.yml, we have:
duckdb: duckdbit's a bit confusing.
I also need to disclose that I've only done the tutorial and didn't have time to do the homework 😅
What I liked:
- I think the employees example is a very good choice for this tutorial, especially the star schema example and the lineage bit.
- The sync source is really cool! Very often I first load my data locally into duckdb and then to load it elsewhere I need to re-run the same code again, so the
sync_sourceway of doing things would be pretty handy. Is there a reason why it's underdlt_plus.transformand notdlt_plus.sources?
What could be better:
- I was able to follow the tutorial without problems, but I did find myself constantly asking what is the advantage of doing this over any of the other multiple ways to transform data in dlt (dbt,
sql_client). Is it because using@dlt_plus.transformlets you do things like specify write dispositions and have your transformations run on cache? I would re-phrase the tutorial that doesn't just focus on how to do the transformation but also on the advantage of doing it this way. - +1 on Akela's point of adding output of the code snippets in the documentation to make it more readable. Especially in the star schema example, visualization for what the output looks like would make it a lot more understandable.
closing this hackathon, will keep the link to these comments in the transformations ticket for future improvement of the transformation docs!