ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

Feature/raptor auto disable structured data

Open hsparks-codes opened this issue 2 weeks ago • 3 comments

What problem does this PR solve?

Feature: This PR implements automatic Raptor disabling for structured data files to address issue #11653.

Issue: PDFs with HTML tables were still being processed by Raptor even after the auto-disable feature was merged.

Background: @ahmadshakil reported that when uploading Fbr_IncomeTaxOrdinance_2001-amended-upto30.06.2024.pdf, the HTML-based tables were still being sent to Raptor. Running docker logs ... | grep -c 'Skipping Raptor for document' returned 0 results.

Root Cause: The original implementation only checked:

  • File extension (.xlsx, .csv, etc.)
  • Parser ID (table)
  • Parser config (html4excel: true)

However, when PDFs are parsed with the naive parser, tables are extracted and converted to <table> HTML tags in the chunk content. This content-level table extraction was not being detected.

Solution: Added content-based detection that analyzes actual chunk content for HTML table tags:

  • New function contains_html_table() detects <table> tags in content
  • New function analyze_chunks_for_tables() calculates table percentage across chunks
  • New function should_skip_raptor_for_chunks() skips Raptor if ≥30% of chunks contain HTML tables
  • Detection happens after chunks are loaded, before Raptor processing begins
  • Threshold is configurable via TABLE_CONTENT_THRESHOLD

New log output:

Skipping Raptor for document {doc_id}: Content contains X% HTML tables (threshold: 30%) - Raptor auto-disabled

Tests: Added 21 new tests for content-based detection (65 total, all passing), including a test case simulating the reported PDF scenario.

Type of change

  • [x] Bug Fix (non-breaking change which fixes an issue)
  • [ ] New Feature (non-breaking change which adds functionality)
  • [ ] Documentation Update
  • [ ] Refactoring
  • [ ] Performance Improvement
  • [ ] Other (please describe):

hsparks-codes avatar Dec 05 '25 08:12 hsparks-codes

@ahmadshakil Can you check the PR?

hsparks-codes avatar Dec 05 '25 08:12 hsparks-codes

@Magicbook1108 @KevinHuSh can you please?

ahmadshakil avatar Dec 05 '25 09:12 ahmadshakil

@KevinHuSh @TeslaZY @cike8899 Would you please check the PR and give me your feedbacks?

hsparks-codes avatar Dec 11 '25 11:12 hsparks-codes

clustor_content.txt

@hsparks-codes is it ok to have attached clustor passed to raptor? it still have table but may be ignored because its comparatively smaller?

ahmadshakil avatar Dec 16 '25 06:12 ahmadshakil

Yes, that's the intended behavior! The current implementation uses a 30% threshold - if less than 30% of chunks contain HTML tables, Raptor will still process the content.

The rationale is that documents with occasional tables (like reports with a few data tables mixed with narrative text) can still benefit from Raptor's hierarchical clustering. We only skip Raptor when tables dominate the content (30%+), where summarization would produce poor results.

If you'd like to adjust this threshold for your use case, it's configurable via TABLE_CONTENT_THRESHOLD in rag/utils/raptor_utils.py. Would you prefer a different threshold value?

hsparks-codes avatar Dec 16 '25 08:12 hsparks-codes