seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Feature][File] Add word parser for RAG support

Open joonseolee opened this issue 3 months ago • 3 comments

Purpose of this pull request

  • Add and refine Word (.docx) reading via WordReadStrategy.
  • Output schema (10 fields): element_id, element_type, text_content, font_style, underline_style, font_size, font_family, text_color, alignment, hyperlink_url.
  • Process document elements in natural order (paragraphs and tables). Footnote text is included within the referencing paragraph’s text_content.
  • Due to Apache POI limitations, the minimal extractable unit is a paragraph. Run-level styles are aggregated at the paragraph level:
    • font_style: NORMAL/BOLD/ITALIC/BOLD_ITALIC
    • underline_style: null or concrete style (e.g., SINGLE)
    • font_size, font_family: first encountered values or null
    • text_color: defaults to "000000" when absent
    • hyperlink_url: all links in a paragraph concatenated with commas

Does this PR introduce any user-facing change?

Yes. The Word reader’s output schema is simplified to 10 fields above. Some formatting attributes now return null when not explicitly present; text_color defaults to "000000". Elements are emitted in document order, and hyperlinks are aggregated per paragraph.

How was this patch tested?

  • Added WordReadStrategyTest to validate all 10 fields against a sample .docx.
  • Verified:
    • Paragraph rows contain text and aggregated formatting/links.
    • Table rows produce a single text blob per table; formatting-related fields are null.

Check list

  • [x] If any new Jar binary package adding in your PR, please add License Notice according New License Guide
  • [ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
  • [ ] If you are contributing the connector code, please check that the following files are updated:
    1. Update plugin-mapping.properties and add new connector information in it
    2. Update the pom file of seatunnel-dist
    3. Add ci label in label-scope-conf
    4. Add e2e testcase in seatunnel-e2e
    5. Update connector plugin_config

Related Issue

#9715

joonseolee avatar Sep 26 '25 07:09 joonseolee

cc @liugddx could you help to review it?

Hisoka-X avatar Sep 26 '25 14:09 Hisoka-X

Please fix the test case issue.

liugddx avatar Sep 29 '25 05:09 liugddx

@joonseolee hi, CI failed, please help solve it

davidzollo avatar Dec 04 '25 03:12 davidzollo