seatunnel
seatunnel copied to clipboard
[Feature][File] Add word parser for RAG support
Purpose of this pull request
- Add and refine Word (.docx) reading via
WordReadStrategy. - Output schema (10 fields):
element_id,element_type,text_content,font_style,underline_style,font_size,font_family,text_color,alignment,hyperlink_url. - Process document elements in natural order (paragraphs and tables). Footnote text is included within the referencing paragraph’s
text_content. - Due to Apache POI limitations, the minimal extractable unit is a paragraph. Run-level styles are aggregated at the paragraph level:
font_style: NORMAL/BOLD/ITALIC/BOLD_ITALICunderline_style: null or concrete style (e.g., SINGLE)font_size,font_family: first encountered values or nulltext_color: defaults to "000000" when absenthyperlink_url: all links in a paragraph concatenated with commas
Does this PR introduce any user-facing change?
Yes. The Word reader’s output schema is simplified to 10 fields above. Some formatting attributes now return null when not explicitly present; text_color defaults to "000000". Elements are emitted in document order, and hyperlinks are aggregated per paragraph.
How was this patch tested?
- Added
WordReadStrategyTestto validate all 10 fields against a sample.docx. - Verified:
- Paragraph rows contain text and aggregated formatting/links.
- Table rows produce a single text blob per table; formatting-related fields are
null.
Check list
- [x] If any new Jar binary package adding in your PR, please add License Notice according New License Guide
- [ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
- [ ] If you are contributing the connector code, please check that the following files are updated:
- Update plugin-mapping.properties and add new connector information in it
- Update the pom file of seatunnel-dist
- Add ci label in label-scope-conf
- Add e2e testcase in seatunnel-e2e
- Update connector plugin_config
Related Issue
#9715
cc @liugddx could you help to review it?
Please fix the test case issue.
@joonseolee hi, CI failed, please help solve it