data-prep-kit
data-prep-kit copied to clipboard
Tokenizer Transform should support multiple values for input parameter doc_content_column
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Transforms/universal/tokenization
Feature
Input parameter doc_content_column is used to identity a column in document which contains text to be tokenized.
In one of our use case, we have a mix of input parquet tables with different columns containing document content ( for example, some parquet tables have document content in column name contents
and some has it in column name text
)
Current implementation only supports 1 single value for the same.
Proposed fix: the input can be a comma separated list of values, the 2nd value will be used for column name in case first value column name does not exists and so on.
u
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!