[Feature] A scalable DPK transform for generating embeddings.
Search before asking
- [x] I searched the issues and found no similar issues.
Component
transdforms/Other
Feature
A scalable DPK transform for generating embeddings.
cc: @klwuibm
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
A simple transform is available in transforms/language/text_encoder.
I have created a fork and branch to adapt the text_encoder to GPU for scale operations. https://github.com/klwuibm/data-prep-kit/blob/text_encoder/transforms/language/text_encoder/dpk_text_encoder/transform.py. @ian-cho
Additional work:
- requirements.txt needs
pytorch,cudaandsentence_transformers. - If the GPU clusters can use container image, then we need to ensure the Dockerfile has proper packages installed.
- Need to update Makefile, too
torch==2.1.0+cu118
torchvision==0.16.0+cu118
torchaudio==2.1.0+cu118
numpy>=1.20.0
sentence_transformers>=3.4
Since we are planning to write the output table with embeddings to storage in Lance format, not parquet format, for storage efficiency and later for retrieval efficiency. We want to represent the embeddings in the transform() with numpy array for better efficiency. The convert_to_numpy=True is good to use.
embeddings = list(
map(
lambda x: self.model.encode(x, convert_to_numpy=True, show_progress_bar=False),
table[self.content_column_name].to_pylist(),
),
)
embeddings_float16 =[emb.astype(np.float16) for emb in embeddings]
pyarrow_embeddings = pa.array(embeddings_float16)
On further study about the lance format and the lanceDB, new challenges pop up.
- The current
DataAccesscannot support theTransform.transform()to return apyarrow tableand with the table written to the output folder as a .lance file. - Instead, the pyarrow table can be written into the lanceDB in COS. In that case, either a new DataAccessLance be created to handle the pyarrow table -> lance. Or, the writing of a pyarrow table to lanceDB can be down by the transform() method.
- After we store the embeddings in lanceDB, the subsequent transform needs to read the data with embeddings to do further processing, such as clustering or retrieving documents whose embeddings are similar to a given embeddings based on cosine similarity. The vector DB in lanceDB hopefully could speed up the retrieval operations.
- At the end of the data prep pipeline, the output might have to be in parquet format.