data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

Tokenizer Transform should support multiple values for input parameter doc_content_column

Open santoshborse opened this issue 6 months ago • 0 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Transforms/universal/tokenization

Feature

Input parameter doc_content_column is used to identity a column in document which contains text to be tokenized.

In one of our use case, we have a mix of input parquet tables with different columns containing document content ( for example, some parquet tables have document content in column name contents and some has it in column name text )

Current implementation only supports 1 single value for the same.

Proposed fix: the input can be a comma separated list of values, the 2nd value will be used for column name in case first value column name does not exists and so on.

u

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

santoshborse avatar Aug 07 '24 21:08 santoshborse