dlt icon indicating copy to clipboard operation
dlt copied to clipboard

docs: add advanced project tutorial

Open sh-rp opened this issue 10 months ago • 6 comments

Description

This PR adds an advanced documentation page of the dlt projects and packaging features.

sh-rp avatar Feb 20 '25 12:02 sh-rp

Deploy Preview for dlt-hub-docs ready!

Name Link
Latest commit 2f92c0a0cee7a9f6e65062c2faf005395e11528a
Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/682223a873140600088a50e0
Deploy Preview https://deploy-preview-2338--dlt-hub-docs.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Feb 20 '25 12:02 netlify[bot]

overall i loved the experience

what stood out for me

  • Feeling of control: when i work in YAML it feels like i am out of code world and it makes me feel like I give control to a stone tablet. Because the YAML is tightly coupled to the app underneath.But by having python access to the project contents, this actually acts more like a manifest that some people on the team can use, but from where the rest of us pick up the orchestration

Next steps in my head: Figure out how I can get a rudimentary dag from this so I can turn it into a deployment

Questions I have that were not answered during the experience

  • what is packaged WRT data? any data? just code?
  • can i get some kind of dependency tree from datasets/pipelines?

adrianbr avatar Feb 25 '25 16:02 adrianbr

was super easy to use, and the experience (of using the current object. ) is the same as accessing different resources from within a dlt source and making changes to it. so quite self explanatory when using it if one already knows how dlt works - and since it is in python, it is easier to use instead of yaml - for me at least, might be easier for people who are used to editing and creating file structures on yaml. so adding a pipeline/destination was super nice and easy.

unsure of what im doing wrong with the catalog commands - i can access the data objects (iterables) but cant quite load dataframes from it, had to go check if the runner worked - it did! the data was loaded right but had trouble loading the dataframes themselves. ^ i think the storage directory from last time was renamed _data? thankfully was able to navigate that bit too - with all the folders that contained info on the pipeline runs. i liked being able to view it right there / as opposed to either going to the root .dlt folder or simply even running the cmd commands.

was unsure of how to run a particular source (in the case of multiple sources) when running the runner, because i think it runs all sources mentioned in the yaml file. so i had to make those changes in the yaml file - declare a particular source if i only want to run that, and not all of them, and then use the runner. would be easier if the run_pipeline() function could also table a source name alongside a pipeline name.

all in all, super easy experience, specially if we're already used to dlt 🤩

hibajamal avatar Feb 26 '25 13:02 hibajamal

Good overall structure and tutorial . dataset.write for some reason didn't work for me. I would also extend the packaged project section with examples of accessing data, this could be useful in the Notebook's contexts.

burnash avatar Feb 28 '25 20:02 burnash

Great experience! Easy to use, everything went smoothly. I love that we can work with Python. At the same time, it feels like we initially tried to avoid Python with YAML, and now we're coming back to it, I personally don’t mind! Just found it funny :)

Also, the tutorial is clear and very well written. Every time I had a question, I found the answer in the next section.

Some notes and problems described below:

  1. Add a CLI example here because we didn’t use datasets with a destination in the basic tutorial. This might be confusing.

    tip If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in python, ore on that below.

  2. Mention here that the user should run this Python script from the same folder where dlt.yml is located.

    Run the above with python do_something.py and see the output.

  3. I misunderstood this line: pipeline = current.entities().create_pipeline("my_pipeline"). I expected the new pipeline to be created in dlt.yml. I’m probably not the only one who will think this way in the future, hehe.

  4. When I run the pipeline using runner.run_pipeline("my_pipeline"), it executes with all destinations in the list: first DuckDB, then BigQuery. Is this the expected behavior? Can we add this to the documentation? Because from the quote below, it feels like only the first destination should be used:

    If you run dataset CLI commands without providing a destination name, dlt+ will always select the first destination in the list by default.

  5. I'm getting an error when using dataset.write:

    File "/Users/alena/dlthub/temp/tutorial/temp.py", line 8, in <module>
      dataset.write(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table")
    TypeError: 'ReadableDBAPIRelation' object is not callable
    

    I found the save method (dlt_plus.destinations.dataset.WritableDataset.save) in dataset, and it seems to work.

  6. Running dlt project init arrow duckdb --package my_dlt_project gives me this error:

    Usage: dlt [-h] [--version] [--disable-telemetry] [--enable-telemetry] [--non-interactive] [--debug]
               {transformation,source,project,profile,pipeline,mcp,license,destination,dbt,dataset,cache,telemetry,schema,init,render-docs,deploy} ...
    dlt: error: unrecognized arguments: my_dlt_project
    

    I’m not sure how to make it work. I’ve tried different combinations, e.g:

    (temp) temp ❯ dlt project init arrow duckdb --package 
    ERROR: Package creation is not implemented yet, please omit the --package flag.
    NOTE: Please refer to our docs at 'https://dlthub.com/docs/intro' for further assistance.
    

AstrakhantsevaAA avatar Feb 28 '25 22:02 AstrakhantsevaAA

In general, the tutorial is pretty straight-forward and understandable. But I did hit a few snags:

  1. When I tried to run the following lines:

    # get a dataset instance pointing to the default destination (first in dataset destinations list) and access data inside of it
    # for this to work this dataset must already exist physically
    dataset = current.catalog().dataset("my_pipeline_dataset")
    # get the row counts of all tables in the dataset as a dataframe
    print(dataset.row_counts().df())
    

    I got this error:

    zsh: segmentation fault  python
    

    I even got this error when running the CLI command:

    dlt dataset my_pipeline_dataset row-counts
    

    What finally fixed it was installing pandas into my environment. I only tried this because of one of Anton's comments, and the error by itself was not helpful.

  2. I also didn't fully understand the section "Accessing entities". It reads like you can create new pipelines using the project context in code but it did not work for me. I tried to create a new pipeline my_pipeline_2:

    # get a pipeline instance
    pipeline = current.entities().create_pipeline("my_pipeline_2")
    # get a destination instance
    destination = current.entities().create_destination("duckdb")
    

    and I got the following error:

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/Users/rahul/Desktop/dlt_project_minihackathon/env/lib/python3.11/site-packages/dlt_plus/project/entity_factory.py", line 178, in create_pipeline
        raise ProjectException(
    dlt_plus.project.exceptions.ProjectException: Destination is not defined for pipeline 'my_pipeline_2'
    
  3. I couldn't get the pip import part to work either. I followed the tutorial but I kept hitting the following error when doing uv run python test_project.py:

    Traceback (most recent call last):
    File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/test_project.py", line 2, in <module>
        from my_dlt_project import current
    ImportError: cannot import name 'current' from 'my_dlt_project' (/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project/__init__.py)
    rahul@dlthubs-MBP dlt_project_minihackathon_import_directory % uv run python test_project.py
    Traceback (most recent call last):
    File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/test_project.py", line 7, in <module>
        print(my_dlt_project.config().current_profile)
            ^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project/__init__.py", line 27, in config
        return context().project
            ^^^^^^^^^
    File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project/__init__.py", line 21, in context
        return ensure_project(run_dir=os.path.dirname(__file__), profile=access_profile())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/dlt_plus/project/run_context.py", line 383, in ensure_project
        raise ProjectRunContextNotAvailable(run_dir)
    dlt_plus.project.exceptions.ProjectRunContextNotAvailable: Path /Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project does not belong to dlt project.
    * it does not contain dlt.yml
    * none of parent folders contains dlt.yml
    * it does not contain pyproject.toml which defines a python module with dlt.yml in the root folder
    * it does not contain pyproject.toml which explicitly defines dlt project with `dlt_project` entry point
    Please refer to dlt+ documentation for details.
    

rahuljo avatar Mar 03 '25 15:03 rahuljo