flytekit icon indicating copy to clipboard operation
flytekit copied to clipboard

hugging Face Datasets Plugin

Open esadler-hbo opened this issue 3 years ago • 6 comments

TL;DR

Hugging Face provides great packages to make working with state-of-the-art language models easy. Integrating with Flyte would connect ETL to the training and inference of deep learning models seamlessly.

Type

  • [ ] Bug Fix
  • [ ] Feature
  • [x] Plugin

Are all requirements met?

  • [ ] Code completed
  • [ ] Smoke tested
  • [ ] Unit tests added
  • [ ] Code documentation added
  • [ ] Any pending items have an associated Issue

Complete description

You can use Hugging face to create high quality embeddings, which is becoming really valuable to a lot of companies. Flyte could elegantly handle the different infrastructure considerations. Notice there is no model training, which makes this workflow especially great.

The first integration is adding Hugging Face's datasets into Flyte's StructuredDatasets. Their datasets is a very performant way to pass data into neural networks. It is based on tf.data.Dataset, but uses Arrow instead of TFRecords. I am excited by the idea of having an ETL job output a pyspark.sql.DataFrame and then doing batch training and batch inference with a Hugging Face dataset seamlessly.

The second integration would be coming up highly scalable task for step 2 in the following workflow:

  1. ETL: prepare dataset of text
  2. Inference: run data through a Hugging Face model pipeline
  3. Upload: Push results to a database that can handle vectors, like Pinecone

I have heard from @gdj0nes that this is common workflow that has infra pain points.

Finally, Hugging Face has a platform where you can save datasets, models, and deploy ML applications. There are opportunities to integrate with their platform that should be mentioned, but are lower priority.

Tracking Issue

https://github.com/flyteorg/flyte/issues/

Follow-up issue

NA OR https://github.com/flyteorg/flyte/issues/

esadler-hbo avatar Jul 30 '22 15:07 esadler-hbo

Codecov Report

Merging #1116 (91e4d7d) into master (aff19cb) will increase coverage by 0.12%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1116      +/-   ##
==========================================
+ Coverage   68.38%   68.51%   +0.12%     
==========================================
  Files         288      288              
  Lines       25963    26095     +132     
  Branches     2899     2920      +21     
==========================================
+ Hits        17756    17880     +124     
- Misses       7728     7736       +8     
  Partials      479      479              
Impacted Files Coverage Δ
flytekit/tools/repo.py 73.68% <0.00%> (ø)
flytekit/tools/fast_registration.py 89.06% <0.00%> (ø)
tests/flytekit/unit/configuration/test_internal.py 100.00% <0.00%> (ø)
...ests/flytekit/unit/tools/test_fast_registration.py 100.00% <0.00%> (ø)
...ctured_dataset/test_structured_dataset_workflow.py 100.00% <0.00%> (ø)
flytekit/core/interface.py 61.80% <0.00%> (+0.02%) :arrow_up:
tests/flytekit/unit/core/test_type_engine.py 98.39% <0.00%> (+0.05%) :arrow_up:
flytekit/clis/sdk_in_container/package.py 96.29% <0.00%> (+0.14%) :arrow_up:
tests/flytekit/unit/core/test_flyte_pickle.py 91.37% <0.00%> (+0.26%) :arrow_up:
tests/flytekit/unit/cli/pyflyte/test_run.py 99.20% <0.00%> (+0.32%) :arrow_up:
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Jul 30 '22 15:07 codecov[bot]

@esadler-hbo you rock! Love all 3 integration goals.

@esadler-hbo & @samhita-alla would you folks be open to writing a blog?

kumare3 avatar Jul 31 '22 14:07 kumare3

@esadler-hbo let me take a look at this and try to fix some of the handling around protocol

wild-endeavor avatar Aug 02 '22 21:08 wild-endeavor

@wild-endeavor amazing! I’ll get a chance to work on this more this weekend.

esadler-hbo avatar Aug 02 '22 21:08 esadler-hbo

Thanks! Yeah I really need to get to those changes I was talking about today.

wild-endeavor avatar Aug 04 '22 15:08 wild-endeavor

tagged you also on the other PR @easadler-hbo

wild-endeavor avatar Aug 19 '22 17:08 wild-endeavor