Using Daft for offline featurization
Hey folks!
Wondering if there's any interest in leveraging Daft (www.getdaft.io) for offline featurization?
We're built with native Ray integrations, so I thought that there would be some natural synergies there. Daft provides:
- Full support for relational operations (joins, sorts, groupby+aggs etc)
- Python UDF support (if you need to run custom things like a Python model)
- Multimodal data support (e.g. URLs, images etc)
Happy to chat more about use-cases if this is of interest :)
Hey @jaychia, happy to discuss. From my understanding, here are possible intersections:
- Volga has a Pandas-like API to define both online and offline data pipelines: transform, filter, join, groupby/aggregate, drop, etc.
- Volga uses Kappa architecture, meaning for offline featurization it simply re-runs online pipeline on offline data. This has a benefit of having a higher accuracy and guaranteed consistency between online and offline pipelines, however can come at a cost of performance, specifically for large datasets.
- Daft is focused on offline computations (vectorized?), meaning it is good at handling large data sets and also has a Pandas-like API, but it has no streaming/real-time capabilities.
From above, an opportunity I see here to benefit both is to implement a common pandas-like interface Volga <-> Daft built on top of both systems. This can be done with making a Daft-compatible API wrapper for Volga API mentioned in 1 and visa-versa (we should have a proper design here and see which features are compatible in both systems wrt online and offline data processing).
This will:
- Add an option for Volga to run offline featurization on Daft for cases where perf matters most (very large datasets). Curious how it would affect the accuracy and consistency between online and offline.
- Daft may also benefit w.r.t running streaming workloads on top of Ray using Volga's engine, which will unlock a whole world of data experiences.
- Daft will also get an ML-first data manipulation interface aimed at real-life ML systems which usually all operate with data sources represented as timestamped event streams.
Let me know what you think!
(1) sounds like the most compelling integration point! Happy to explore integrations there that might make sense.