dlt
dlt copied to clipboard
define Python transformations with Hamilton
Feature description
With "Extract, Transform, Load" (ETL) as a frame of reference, dlt does "EL" and Hamilton does "T".
What is Hamilton
In short, Hamilton is a library to define a DAG of data transformations in Python. It is similar in scope to dbt, but it's supports all Python types, not just tables/dataframes/SQL constructs. Users can write transformations with Python primitives, pandas, polars, Spark, Ibis, etc. Many users adopt Hamilton for feature engineering (jaffle shop example. It also allows users to define machine learning and LLM dataflows.
It uses a declarative API, which essentially consists of
- define your DAG in a Python module
- pass the DAG to the
Driver
responsible for execution - request nodes from the DAG to be executed (e.g., features, tables, models to train)
Integration ideas
dlt plugin for Hamilton
We already added a dlt plugin in Hamilton allowing users to load dlt.Resource
as input and save outputs to dlt.Destination
. This is useful for Hamilton users who want to start using dlt and run both as a unified pipeline. Also, some Hamilton DAG nodes might be "incompatible with dlt" (e.g., an XGBoost model).
Hamilton help for dlt
It appears to make sense to have a "Hamilton helper" in dlt, similar to the dbt runner. It would help dlt users to package their Hamilton code and bundle it with their dlt pipeline to be executed. A typical pattern would look like this (full ref):
import dlt
from hamilton import driver
import slack # NOTE this is dlt code, not an official Slack library
import transform # module containing dataflow definition
# EXTRACT & LOAD
pipeline = dlt.pipeline(
pipeline_name="slack",
destination='duckdb',
dataset_name="slack_community_backup"
)
source = slack.slack_source(
selected_channels=["general"], replies=True
)
load_info = pipeline.run(source)
# TRANSFORM
dr = driver.Builder().with_modules(transform).build()
results = dr.execute(
["insert_threads"], # query the `threads` node
inputs=dict(pipeline=pipeline) # pass the dlt load info
)
Action
- get a sense of what dlt users are looking for, their needs regarding Python transforms and usage patterns
- define an API and work towards a Hamilton helper in dlt
- maybe we only need to publish docs and guides