data-prep-kit Tokenizer Transform should support multiple values for input parameter doc_content

Tokenizer Transform should support multiple values for input parameter doc_content_column

Open santoshborse opened this issue 6 months ago • 0 comments

Search before asking

[X] I searched the issues and found no similar issues.

Component

Transforms/universal/tokenization

Feature

Input parameter doc_content_column is used to identity a column in document which contains text to be tokenized.

In one of our use case, we have a mix of input parquet tables with different columns containing document content ( for example, some parquet tables have document content in column name contents and some has it in column name text )

Current implementation only supports 1 single value for the same.

Proposed fix: the input can be a comma separated list of values, the 2nd value will be used for column name in case first value column name does not exists and so on.

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

Aug 07 '24 21:08 santoshborse

data-prep-kit data-prep-kit copied to clipboard

Tokenizer Transform should support multiple values for input parameter doc_content_column

Search before asking

Component

Feature

Are you willing to submit a PR?

data-prep-kit
data-prep-kit copied to clipboard