data-prep-kit [Feature] A scalable DPK transform for generating embeddings.

Search before asking

[x] I searched the issues and found no similar issues.

Component

transdforms/Other

Feature

A scalable DPK transform for generating embeddings.

cc: @klwuibm

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Mar 14 '25 13:03 Hajar-Emami

A simple transform is available in transforms/language/text_encoder.

Mar 14 '25 17:03 klwuibm

I have created a fork and branch to adapt the text_encoder to GPU for scale operations. https://github.com/klwuibm/data-prep-kit/blob/text_encoder/transforms/language/text_encoder/dpk_text_encoder/transform.py. @ian-cho

Mar 17 '25 15:03 klwuibm

Additional work:

requirements.txt needs pytorch, cuda and sentence_transformers.
If the GPU clusters can use container image, then we need to ensure the Dockerfile has proper packages installed.
Need to update Makefile, too

Mar 17 '25 15:03 klwuibm

torch==2.1.0+cu118
torchvision==0.16.0+cu118
torchaudio==2.1.0+cu118
numpy>=1.20.0
sentence_transformers>=3.4

Mar 17 '25 15:03 klwuibm

Since we are planning to write the output table with embeddings to storage in Lance format, not parquet format, for storage efficiency and later for retrieval efficiency. We want to represent the embeddings in the transform() with numpy array for better efficiency. The convert_to_numpy=True is good to use.

embeddings = list(
            map(
                lambda x: self.model.encode(x, convert_to_numpy=True, show_progress_bar=False),
                table[self.content_column_name].to_pylist(),
            ),
        )

        embeddings_float16 =[emb.astype(np.float16) for emb in embeddings]
        pyarrow_embeddings = pa.array(embeddings_float16)

Mar 17 '25 20:03 klwuibm

On further study about the lance format and the lanceDB, new challenges pop up.

The current DataAccess cannot support the Transform.transform() to return a pyarrow table and with the table written to the output folder as a .lance file.
Instead, the pyarrow table can be written into the lanceDB in COS. In that case, either a new DataAccessLance be created to handle the pyarrow table -> lance. Or, the writing of a pyarrow table to lanceDB can be down by the transform() method.
After we store the embeddings in lanceDB, the subsequent transform needs to read the data with embeddings to do further processing, such as clustering or retrieving documents whose embeddings are similar to a given embeddings based on cosine similarity. The vector DB in lanceDB hopefully could speed up the retrieval operations.
At the end of the data prep pipeline, the output might have to be in parquet format.

Mar 20 '25 20:03 klwuibm