Feature/raptor auto disable structured data
What problem does this PR solve?
Feature: This PR implements automatic Raptor disabling for structured data files to address issue #11653.
Issue: PDFs with HTML tables were still being processed by Raptor even after the auto-disable feature was merged.
Background: @ahmadshakil reported that when uploading Fbr_IncomeTaxOrdinance_2001-amended-upto30.06.2024.pdf, the HTML-based tables were still being sent to Raptor. Running docker logs ... | grep -c 'Skipping Raptor for document' returned 0 results.
Root Cause: The original implementation only checked:
- File extension (
.xlsx,.csv, etc.) - Parser ID (
table) - Parser config (
html4excel: true)
However, when PDFs are parsed with the naive parser, tables are extracted and converted to <table> HTML tags in the chunk content. This content-level table extraction was not being detected.
Solution: Added content-based detection that analyzes actual chunk content for HTML table tags:
- New function
contains_html_table()detects<table>tags in content - New function
analyze_chunks_for_tables()calculates table percentage across chunks - New function
should_skip_raptor_for_chunks()skips Raptor if ≥30% of chunks contain HTML tables - Detection happens after chunks are loaded, before Raptor processing begins
- Threshold is configurable via
TABLE_CONTENT_THRESHOLD
New log output:
Skipping Raptor for document {doc_id}: Content contains X% HTML tables (threshold: 30%) - Raptor auto-disabled
Tests: Added 21 new tests for content-based detection (65 total, all passing), including a test case simulating the reported PDF scenario.
Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
@ahmadshakil Can you check the PR?
@Magicbook1108 @KevinHuSh can you please?
@KevinHuSh @TeslaZY @cike8899 Would you please check the PR and give me your feedbacks?
@hsparks-codes is it ok to have attached clustor passed to raptor? it still have table but may be ignored because its comparatively smaller?
Yes, that's the intended behavior! The current implementation uses a 30% threshold - if less than 30% of chunks contain HTML tables, Raptor will still process the content.
The rationale is that documents with occasional tables (like reports with a few data tables mixed with narrative text) can still benefit from Raptor's hierarchical clustering. We only skip Raptor when tables dominate the content (30%+), where summarization would produce poor results.
If you'd like to adjust this threshold for your use case, it's configurable via TABLE_CONTENT_THRESHOLD in rag/utils/raptor_utils.py. Would you prefer a different threshold value?