hugging Face Datasets Plugin
TL;DR
Hugging Face provides great packages to make working with state-of-the-art language models easy. Integrating with Flyte would connect ETL to the training and inference of deep learning models seamlessly.
Type
- [ ] Bug Fix
- [ ] Feature
- [x] Plugin
Are all requirements met?
- [ ] Code completed
- [ ] Smoke tested
- [ ] Unit tests added
- [ ] Code documentation added
- [ ] Any pending items have an associated Issue
Complete description
You can use Hugging face to create high quality embeddings, which is becoming really valuable to a lot of companies. Flyte could elegantly handle the different infrastructure considerations. Notice there is no model training, which makes this workflow especially great.
The first integration is adding Hugging Face's datasets into Flyte's StructuredDatasets. Their datasets is a very performant way to pass data into neural networks. It is based on tf.data.Dataset, but uses Arrow instead of TFRecords. I am excited by the idea of having an ETL job output a pyspark.sql.DataFrame and then doing batch training and batch inference with a Hugging Face dataset seamlessly.
The second integration would be coming up highly scalable task for step 2 in the following workflow:
- ETL: prepare dataset of text
- Inference: run data through a Hugging Face model pipeline
- Upload: Push results to a database that can handle vectors, like Pinecone
I have heard from @gdj0nes that this is common workflow that has infra pain points.
Finally, Hugging Face has a platform where you can save datasets, models, and deploy ML applications. There are opportunities to integrate with their platform that should be mentioned, but are lower priority.
Tracking Issue
https://github.com/flyteorg/flyte/issues/
Follow-up issue
NA
OR
https://github.com/flyteorg/flyte/issues/
Codecov Report
Merging #1116 (91e4d7d) into master (aff19cb) will increase coverage by
0.12%. The diff coverage isn/a.
@@ Coverage Diff @@
## master #1116 +/- ##
==========================================
+ Coverage 68.38% 68.51% +0.12%
==========================================
Files 288 288
Lines 25963 26095 +132
Branches 2899 2920 +21
==========================================
+ Hits 17756 17880 +124
- Misses 7728 7736 +8
Partials 479 479
| Impacted Files | Coverage Δ | |
|---|---|---|
| flytekit/tools/repo.py | 73.68% <0.00%> (ø) |
|
| flytekit/tools/fast_registration.py | 89.06% <0.00%> (ø) |
|
| tests/flytekit/unit/configuration/test_internal.py | 100.00% <0.00%> (ø) |
|
| ...ests/flytekit/unit/tools/test_fast_registration.py | 100.00% <0.00%> (ø) |
|
| ...ctured_dataset/test_structured_dataset_workflow.py | 100.00% <0.00%> (ø) |
|
| flytekit/core/interface.py | 61.80% <0.00%> (+0.02%) |
:arrow_up: |
| tests/flytekit/unit/core/test_type_engine.py | 98.39% <0.00%> (+0.05%) |
:arrow_up: |
| flytekit/clis/sdk_in_container/package.py | 96.29% <0.00%> (+0.14%) |
:arrow_up: |
| tests/flytekit/unit/core/test_flyte_pickle.py | 91.37% <0.00%> (+0.26%) |
:arrow_up: |
| tests/flytekit/unit/cli/pyflyte/test_run.py | 99.20% <0.00%> (+0.32%) |
:arrow_up: |
| ... and 5 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
@esadler-hbo you rock! Love all 3 integration goals.
@esadler-hbo & @samhita-alla would you folks be open to writing a blog?
@esadler-hbo let me take a look at this and try to fix some of the handling around protocol
@wild-endeavor amazing! I’ll get a chance to work on this more this weekend.
Thanks! Yeah I really need to get to those changes I was talking about today.
tagged you also on the other PR @easadler-hbo