docs: add advanced project tutorial
Description
This PR adds an advanced documentation page of the dlt projects and packaging features.
Deploy Preview for dlt-hub-docs ready!
| Name | Link |
|---|---|
| Latest commit | 2f92c0a0cee7a9f6e65062c2faf005395e11528a |
| Latest deploy log | https://app.netlify.com/sites/dlt-hub-docs/deploys/682223a873140600088a50e0 |
| Deploy Preview | https://deploy-preview-2338--dlt-hub-docs.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
overall i loved the experience
what stood out for me
- Feeling of control: when i work in YAML it feels like i am out of code world and it makes me feel like I give control to a stone tablet. Because the YAML is tightly coupled to the app underneath.But by having python access to the project contents, this actually acts more like a manifest that some people on the team can use, but from where the rest of us pick up the orchestration
Next steps in my head: Figure out how I can get a rudimentary dag from this so I can turn it into a deployment
Questions I have that were not answered during the experience
- what is packaged WRT data? any data? just code?
- can i get some kind of dependency tree from datasets/pipelines?
was super easy to use, and the experience (of using the current object. ) is the same as accessing different resources from within a dlt source and making changes to it. so quite self explanatory when using it if one already knows how dlt works - and since it is in python, it is easier to use instead of yaml - for me at least, might be easier for people who are used to editing and creating file structures on yaml. so adding a pipeline/destination was super nice and easy.
unsure of what im doing wrong with the catalog commands - i can access the data objects (iterables) but cant quite load dataframes from it, had to go check if the runner worked - it did! the data was loaded right but had trouble loading the dataframes themselves. ^ i think the storage directory from last time was renamed _data? thankfully was able to navigate that bit too - with all the folders that contained info on the pipeline runs. i liked being able to view it right there / as opposed to either going to the root .dlt folder or simply even running the cmd commands.
was unsure of how to run a particular source (in the case of multiple sources) when running the runner, because i think it runs all sources mentioned in the yaml file. so i had to make those changes in the yaml file - declare a particular source if i only want to run that, and not all of them, and then use the runner. would be easier if the run_pipeline() function could also table a source name alongside a pipeline name.
all in all, super easy experience, specially if we're already used to dlt 🤩
Good overall structure and tutorial . dataset.write for some reason didn't work for me. I would also extend the packaged project section with examples of accessing data, this could be useful in the Notebook's contexts.
Great experience! Easy to use, everything went smoothly. I love that we can work with Python. At the same time, it feels like we initially tried to avoid Python with YAML, and now we're coming back to it, I personally don’t mind! Just found it funny :)
Also, the tutorial is clear and very well written. Every time I had a question, I found the answer in the next section.
Some notes and problems described below:
-
Add a CLI example here because we didn’t use datasets with a destination in the basic tutorial. This might be confusing.
tip If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in python, ore on that below.
-
Mention here that the user should run this Python script from the same folder where
dlt.ymlis located.Run the above with python do_something.py and see the output.
-
I misunderstood this line:
pipeline = current.entities().create_pipeline("my_pipeline"). I expected the new pipeline to be created indlt.yml. I’m probably not the only one who will think this way in the future, hehe. -
When I run the pipeline using
runner.run_pipeline("my_pipeline"), it executes with all destinations in the list: first DuckDB, then BigQuery. Is this the expected behavior? Can we add this to the documentation? Because from the quote below, it feels like only the first destination should be used:If you run dataset CLI commands without providing a destination name, dlt+ will always select the first destination in the list by default.
-
I'm getting an error when using
dataset.write:File "/Users/alena/dlthub/temp/tutorial/temp.py", line 8, in <module> dataset.write(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table") TypeError: 'ReadableDBAPIRelation' object is not callableI found the
savemethod (dlt_plus.destinations.dataset.WritableDataset.save) indataset, and it seems to work. -
Running
dlt project init arrow duckdb --package my_dlt_projectgives me this error:Usage: dlt [-h] [--version] [--disable-telemetry] [--enable-telemetry] [--non-interactive] [--debug] {transformation,source,project,profile,pipeline,mcp,license,destination,dbt,dataset,cache,telemetry,schema,init,render-docs,deploy} ... dlt: error: unrecognized arguments: my_dlt_projectI’m not sure how to make it work. I’ve tried different combinations, e.g:
(temp) temp ❯ dlt project init arrow duckdb --package ERROR: Package creation is not implemented yet, please omit the --package flag. NOTE: Please refer to our docs at 'https://dlthub.com/docs/intro' for further assistance.
In general, the tutorial is pretty straight-forward and understandable. But I did hit a few snags:
-
When I tried to run the following lines:
# get a dataset instance pointing to the default destination (first in dataset destinations list) and access data inside of it # for this to work this dataset must already exist physically dataset = current.catalog().dataset("my_pipeline_dataset") # get the row counts of all tables in the dataset as a dataframe print(dataset.row_counts().df())I got this error:
zsh: segmentation fault pythonI even got this error when running the CLI command:
dlt dataset my_pipeline_dataset row-countsWhat finally fixed it was installing pandas into my environment. I only tried this because of one of Anton's comments, and the error by itself was not helpful.
-
I also didn't fully understand the section "Accessing entities". It reads like you can create new pipelines using the project context in code but it did not work for me. I tried to create a new pipeline
my_pipeline_2:# get a pipeline instance pipeline = current.entities().create_pipeline("my_pipeline_2") # get a destination instance destination = current.entities().create_destination("duckdb")and I got the following error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/rahul/Desktop/dlt_project_minihackathon/env/lib/python3.11/site-packages/dlt_plus/project/entity_factory.py", line 178, in create_pipeline raise ProjectException( dlt_plus.project.exceptions.ProjectException: Destination is not defined for pipeline 'my_pipeline_2' -
I couldn't get the pip import part to work either. I followed the tutorial but I kept hitting the following error when doing
uv run python test_project.py:Traceback (most recent call last): File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/test_project.py", line 2, in <module> from my_dlt_project import current ImportError: cannot import name 'current' from 'my_dlt_project' (/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project/__init__.py) rahul@dlthubs-MBP dlt_project_minihackathon_import_directory % uv run python test_project.py Traceback (most recent call last): File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/test_project.py", line 7, in <module> print(my_dlt_project.config().current_profile) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project/__init__.py", line 27, in config return context().project ^^^^^^^^^ File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project/__init__.py", line 21, in context return ensure_project(run_dir=os.path.dirname(__file__), profile=access_profile()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/dlt_plus/project/run_context.py", line 383, in ensure_project raise ProjectRunContextNotAvailable(run_dir) dlt_plus.project.exceptions.ProjectRunContextNotAvailable: Path /Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project does not belong to dlt project. * it does not contain dlt.yml * none of parent folders contains dlt.yml * it does not contain pyproject.toml which defines a python module with dlt.yml in the root folder * it does not contain pyproject.toml which explicitly defines dlt project with `dlt_project` entry point Please refer to dlt+ documentation for details.