dbt-fal
dbt-fal copied to clipboard
[Design Doc] fal-dbt feature store
What are we building?
A feature store is data system that facilitates managing the data transformations centrally for predictive analysis and ML models in production.
fal-dbt feature store is a feature store implementation that consists of a dbt package and a python library.
Why are we doing this?
Empower analytics engineer: ML models and analytics operate on the same data. Analytics engineers know this data inside out. They are the ones setting up metrics, ensuring data quality and freshness. Why shouldn’t they be the ones responsible for the predictive analysis? With the rise of open source modelling libraries most of the work that goes into an ML model is done on the data processing side.
Leverage the Warehouse: Warehouses are secure, scalable and relatively cheap environments to do data transformation. Doing transformations in other environments is at least an order of magnitude more complicated. Warehouse should be part of the ML engineer toolkit especially for batch predictions. dbt is the best tool out there to do transformations with the warehouse. dbt feature store will make ML workflows leverage all the advantages of the modern data warehouses.
Strategy
The first building block for the fal feature store is the fal-dbt cli tool. Using fal-dbt cli, dbt users are able perform various tasks via python scripts after their dbt workflows.
✅ Milestone 1: Add ability to read feature store config from dbt ymls
✅ Milestone 2: Run create_dataset
from the fal dbt python client
✅ Milestone 3: Move feature to online store and provide online store client
Aready Possible with fal-dbt cli
✅ Milestone 4: Add ability to etl data from a fal script
✅ Milestone 5: Model Monitoring
Stretch Goals
⭐️ Milestone: Logged real time models
Online/Offline Predictions vs Logged Features
There are roughly 3 types of ML systems in terms of complexity; offline predictions, online predictions with batch features and online predictions with real-time features. Most of the use cases we saw were also in the same order, "online predictions with real-time features" being the least common.
A warehouse can handle all the feature calculations for offline use cases, combined with the firestore reverse etl we can also handle online predictions with batch features. This leaves out "online predictions with real-time features" which is out of scope for the initial implementation. We plan on tackling that with logged features as a stretch goal.
Implementation
Feature Definitions
Feature store configurations are added under model configurations as part of the fal
meta tag. Each feature is required to have an entity_id
and a timestamp
field.
entity_id
andtimestamp
fields are later used for the point in time join of a list of features and a label.
Optionally feature definitions can include fal
scripts for downstream workflows. For example the dbt model below includes a make_avaliable_online.py
(link to example) script. A typical etl step that moves the latest values of features from the data warehouse to an OLTP database.
## schema.yml
models:
- name: bike_duration
columns:
- name: trip_count_last_week
- name: trip_duration_last_week
- name: user_id
- name: start_date
meta:
fal:
feature_store:
entity_id: user_id
timestamp: start_date
scripts:
- make_avaliable_online.py
A label is also a defined as a feature using the configuration above. fal-dbt feature doesn’t have any requirements or assumptions on what constitutes a label.
Create Dataset
A feature store configuration doesn’t have any effect on your infrastructure unless it is used in a dataset calculation. A dataset in fal-dbt feature store is a dataframe that includes all the features and the label for the machine learning model being built.
There are two ways to create a dataset.
Creating a dataset with dbt macro:
// dataset_name.sql
SELECT
*
FROM
{{ feature_store.create_dataset(
features = ["total_transactions", "credit_score"],
label_name = "credit_decision"
) }}
This model can later be referred in a fal script:
df = ref("dataset_name")
Creating a dataset with python:
from fal.ml.feature_store import FeatureStore
store = FeatureStore(creds="/../creds.json") // path to service account
ds = store.create_dataset(
name="dataset_name",
features=["total_transactions", "credit_score"],
label="credit_decision"
)
df = ds.get_pandas_dataframe()
Python Client
class FeatureStore
def create_dataset(dataset_name: str, features: List[str], label: str)
def get_dataset(dataset_name: str)
@dataclass
class OnlineClient
client_config: ClientConfig
def get_feature_vector(dbt_model: str, feature_name: str)
def get_feature_vectors(feature_list: List[Tuple[str, str]])
Scheduling
Scheduling is usually an afterthought in existing feature store implementations. It is left to the users to handle using tools like Airflow. fal-dbt feature store’s close integration with dbt offloads scheduling responsibilities to the dbt scheduler.
Incremental Calculations
dbt incremental calculations make sure feature calculations are not wasteful, they can be incrementally calculated, and always fresh if scheduled properly with the dbt scheduler. In fal-dbt feature store there are no lazy feature calculations all features are assumed to be fresh.
Stretch Goals
Logged Features
We have talked about this before but we never had a clear design on how we would achieve this. This fits very well with the "do the simple thing first" tenant mentioned above. Logged features achieve real time transformations by transforming the data with the application code and then storing the transformed version in the data warehouse for training. This enables transformation logic to live in just one place (application code) and not duplicated in the warehouse and application. Not only does it live in the application code, it's also written with the web stack where applying business logic is easier with the help of an ORM or similar.
This is almost too good to be true but problems start to emerge when the transformation code starts changing over time. Once a change is made in the application code, the training data still has the shape of the older data. Model has to be retrained, but also older data needs to be back-filled - just one time- to apply the new transformation. This is not ideal but better than maintaining two code-bases.
How can we build tools to make this easier?
- Make back-filling easier
- Make writing application code with warehouse SQL easier
Love love love this. Let me know how I can help :)
👀 https://github.com/fal-ai/dbt_feature_store 👀