data-prep-kit
data-prep-kit copied to clipboard
[Feature] Capability to chunk text for RAG systems
trafficstars
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Transforms/Other
Feature
The goal is to add a new transform that can take in the extracted text and chunk it. The input will be parquet files where every document is stored in one row. The output will be chunks, such that every chunk is stored in one row. Chunk size should be a parameter exposed to the user.
This new transform should be added along with other language modules here https://github.com/IBM/data-prep-kit/tree/dev/transforms/language
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
Done in https://github.com/IBM/data-prep-kit/pull/461