ragflow Feature/raptor auto disable structured data

What problem does this PR solve?

Feature: This PR implements automatic Raptor disabling for structured data files to address issue #11653.

Issue: PDFs with HTML tables were still being processed by Raptor even after the auto-disable feature was merged.

Background: @ahmadshakil reported that when uploading Fbr_IncomeTaxOrdinance_2001-amended-upto30.06.2024.pdf, the HTML-based tables were still being sent to Raptor. Running docker logs ... | grep -c 'Skipping Raptor for document' returned 0 results.

Root Cause: The original implementation only checked:

File extension (.xlsx, .csv, etc.)
Parser ID (table)
Parser config (html4excel: true)

However, when PDFs are parsed with the naive parser, tables are extracted and converted to <table> HTML tags in the chunk content. This content-level table extraction was not being detected.

Solution: Added content-based detection that analyzes actual chunk content for HTML table tags:

New function contains_html_table() detects <table> tags in content
New function analyze_chunks_for_tables() calculates table percentage across chunks
New function should_skip_raptor_for_chunks() skips Raptor if ≥30% of chunks contain HTML tables
Detection happens after chunks are loaded, before Raptor processing begins
Threshold is configurable via TABLE_CONTENT_THRESHOLD

New log output:

Skipping Raptor for document {doc_id}: Content contains X% HTML tables (threshold: 30%) - Raptor auto-disabled

Tests: Added 21 new tests for content-based detection (65 total, all passing), including a test case simulating the reported PDF scenario.

Type of change

[x] Bug Fix (non-breaking change which fixes an issue)
[ ] New Feature (non-breaking change which adds functionality)
[ ] Documentation Update
[ ] Refactoring
[ ] Performance Improvement
[ ] Other (please describe):

Dec 05 '25 08:12 hsparks-codes

@ahmadshakil Can you check the PR?

Dec 05 '25 08:12 hsparks-codes

@Magicbook1108 @KevinHuSh can you please?

Dec 05 '25 09:12 ahmadshakil

@KevinHuSh @TeslaZY @cike8899 Would you please check the PR and give me your feedbacks?

Dec 11 '25 11:12 hsparks-codes

clustor_content.txt

@hsparks-codes is it ok to have attached clustor passed to raptor? it still have table but may be ignored because its comparatively smaller?

Dec 16 '25 06:12 ahmadshakil

Yes, that's the intended behavior! The current implementation uses a 30% threshold - if less than 30% of chunks contain HTML tables, Raptor will still process the content.

The rationale is that documents with occasional tables (like reports with a few data tables mixed with narrative text) can still benefit from Raptor's hierarchical clustering. We only skip Raptor when tables dominate the content (30%+), where summarization would produce poor results.

If you'd like to adjust this threshold for your use case, it's configurable via TABLE_CONTENT_THRESHOLD in rag/utils/raptor_utils.py. Would you prefer a different threshold value?

Dec 16 '25 08:12 hsparks-codes