data-integration-library icon indicating copy to clipboard operation
data-integration-library copied to clipboard

Adding text extractor for extracting unstructured output

Open mkumar1984 opened this issue 3 years ago • 0 comments

Currently DIL supports many structured format like CSV, Json, Avro and also many compression formats. Unstructured text format is supported only through FileDumpExtractor, which dumps output to HDFS. With FileDumpExtractor, output cannot be passed to any converter. Text Extractor should be supported, which can extract output in any format and pass it to some converter for further ETL rather than directly pushing this to HDFS. This is useful in cases where we want to get some URL output and then apply some custom parsing to get the required output.

mkumar1984 avatar Oct 05 '21 05:10 mkumar1984