data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Feature] Capability to chunk text for RAG systems

Open Bytes-Explorer opened this issue 1 year ago • 1 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

The goal is to add a new transform that can take in the extracted text and chunk it. The input will be parquet files where every document is stored in one row. The output will be chunks, such that every chunk is stored in one row. Chunk size should be a parameter exposed to the user.

This new transform should be added along with other language modules here https://github.com/IBM/data-prep-kit/tree/dev/transforms/language

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

Bytes-Explorer avatar Jul 25 '24 06:07 Bytes-Explorer

Done in https://github.com/IBM/data-prep-kit/pull/461

dolfim-ibm avatar Jul 31 '24 20:07 dolfim-ibm