Irrelevant Chunks
Even requesting a knowledge crawl at level one on a URL results in many useless chunks, including things like headers, footers, side content, forum/discussion links, etc.
The crawl isn't smart and just grabs everything, resulting in sometimes hundreds of useless chunks.
My current solution has been to "print" relevant pages into PDFs, combine all the PDFs into one, convert the combined PDF using Docling into a .md to use with Archon.
Below is a chunk, with a broken end link due to chunking, just showing links to Tauri releases. It provides no relevant information to use in a coding project.
0-beta.2 ](https://tauri.app/release/tauri/v2.0.0-beta.2/) * [ 2.0.0-beta.1 ](https://tauri.app/release/tauri/v2.0.0-beta.1/) * [ 2.0.0-beta.0 ](https://tauri.app/release/tauri/v2.0.0-beta.0/) * [ 2.0.0-alpha.21 ](https://tauri.app/release/tauri/v2.0.0-alpha.21/) * [ 2.0.0-alpha.20 ](https://tauri.app/release/tauri/v2.0.0-alpha.20/) * [ 2.0.0-alpha.19 ](https://tauri.app/release/tauri/v2.0.0-alpha.19/) * [ 2.0.0-alpha.18 ](https://tauri.app/release/tauri/v2.0.0-alpha.18/) * [ 2.0.0-alpha.17 ](https://tauri.app/release/tauri/v2.0.0-alpha.17/) * [ 2.0.0-alpha.16 ](https://tauri.app/release/tauri/v2.0.0-alpha.16/) * [ 2.0.0-alpha.15 ](https://tauri.app/release/tauri/v2.0.0-alpha.15/) * [ 2.0.0-alpha.14 ](https://tauri.app/release/tauri/v2.0.0-alpha.14/) * [ 2.0.0-alpha.13 ](https://tauri.app/release/tauri/v2.0.0-alpha.13/) * [ 2.0.0-alpha.12 ](https://tauri.app/release/tauri/v2.0.0-alpha.12/) * [ 2.0.0-alpha.11 ](https://tauri.app/release/tauri/v2.0.0-alpha.11/) * [ 2.0.0-alpha.10 ](https://tauri.app/release/tauri/v2.0.0-alpha.10/) * [ 2.0.0-alpha.9 ](https://tauri.app/release/tauri/v2.0.0-alpha.9/) * [ 2.0.0-alpha.8 ](https://tauri.app/release/tauri/v2.0.0-alpha.8/) * [ 2.0.0-alpha.7 ](https://tauri.app/release/tauri/v2.0.0-alpha.7/) * [ 2.0.0-alpha.6 ](https://tauri.app/release/tauri/v2.0.0-alpha.6/) * [ 2.0.0-alpha.5 ](https://tauri.app/release/tauri/v2.0.0-alpha.5/) * [ 2.0.0-alpha.4 ](https://tauri.app/release/tauri/v2.0.0-alpha.4/) * [ 2.0.0-alpha.3 ](https://tauri.app/release/tauri/v2.0.0-alpha.3/) * [ 2.0.0-alpha.2 ](https://tauri.app/release/tauri/v2.0.0-alpha.2/) * [ 2.0.0-alpha.1 ](https://tauri.app/release/tauri/v2.0.0-alpha.1/) * [ 2.0.0-alpha.0 ](https://tauri.app/release/tauri/v2.0.0-alpha.0/) * [ 1.6.0 ](https://tauri.app/release/tauri/v1.6.0/) * [ 1.5.4 ](https://tauri.app/release/tauri/v1.5.4/) * [ 1.5.3 ](https://tauri.app/release/tauri/v1.5.3/) * [ 1.5.2 ](https://tauri.app/release/tauri/v1.5.2/) * [ 1.5.1 ](https://tauri.app/release/tauri/v1.5.1/) * [ 1.5.0 ](https://tauri.app/release/tauri/v1.5.0/) * [ 1.4.1 ](https://tauri.app/release/tauri/v1.4.1/) * [ 1.4.0 ](https://tauri.app/release/tauri/v1.4.0/) * [ 1.3.0 ](https://tauri.app/release/tauri/v1.3.0/) * [ 1.2.5 ](https://tauri.app/release/tauri/v1.2.5/) * [ 1.2.4 ](https://tauri.app/release/tauri/v1.2.4/) * [ 1.2.3 ](https://tauri.app/release/tauri/v1.2.3/) * [ 1.2.2 ](https://tauri.app/release/tauri/v1.2.2/) * [ 1.2.1 ](https://tauri.app/release/tauri/v1.2.1/) * [ 1.2.0 ](https://tauri.app/release/tauri/v1.2.0/) * [ 1.1.4 ](https://tauri.app/release/tauri/v1.1.4/) * [ 1.1.3 ](https://tauri.app/release/tauri/v1.1.3/) * [ 1.1.2 ](https://tauri.app/release/tauri/v1.1.2/) * [ 1.1.1 ](https://tauri.app/release/tauri/v1.1.1/) * [ 1.1.0 ](https://tauri.app/release/tauri/v1.1.0/) * [ 1.0.9 ](https://tauri.app/release/tauri/v1.0.9/) * [ 1.0.8 ](https://tauri.app/release/tauri/v1.0.8/) * [ 1.0.7 ](https://tauri.app/release/tauri/v1.0.7/) * [ 1.0.6 ](https://tauri.app/release/tauri/v1.0.6/) * [ 1.0.5 ](https://tauri.app/release/tauri/v1.0.5/) * [ 1.0.4 ](https://tauri.app/release/tauri/v1.0.4/) * [ 1.0.3 ](https://tauri.app/release/tauri/v1.0.3/) * [ 1.0.2 ](https://tauri.app/release/tauri/v1.0.2/) * [ 1.0.1 ](https://tauri.app/release/tauri/v1.0.1/) * [ 1.0.0 ](https://tauri.app/release/tauri/v1.0.0/) * [ 1.0.0-rc.17 ](https://tauri.app/release/tauri/v1.0.0-rc.17/) * [ 1.0.0-rc.16 ](https://tauri.app/release/tauri/v1.0.0-rc.16/) * [ 1.0.0-rc.15 ](https://tauri.app/release/tauri/v1.0.0-rc.15/) * [ 1.0.0-rc.14 ](https://tauri.app/release/tauri/v1.0.0-rc.14/) * [ 1.0.0-rc.13 ](https://tauri.app/release/tauri/v1.0.0-rc.13/) * [ 1.0.0-rc.12 ](https://tauri.app/release/tauri/v1.0.0-rc.12/) * [ 1.0.0-rc.11 ](https://tauri.app/release/tauri/v1.0.0-rc.11/) * [ 1.0.0-rc.10 ](https://tauri.app/release/tauri/v1.0.0-rc.10/) * [ 1.0.0-rc.9 ](https://tauri.app/release/tauri/v1.0.0-rc.9/) * [ 1.0.0-rc.8 ](https://tauri.app/release/tauri/v1.0.0-rc.8/) * [ 1.0.0-rc.7 ](https://tauri.app/release/tauri/v1.0.0-rc.7/) * [ 1.0.0-rc.6 ](https://tauri.app/release/tauri/v1.0.0-rc.6/) * [ 1.0.0-rc.5 ](https://tauri.app/release/tauri/v1.0.0-rc.5/) * [ 1.0.0-rc.4 ](https://tauri.app/release/tauri/v1.0.0-rc.4/) * [ 1.0.0-rc.3 ](https://tauri.app/release/tauri/v1.0.0-rc.3/) * [ 1.0.0-rc.2 ](https://tauri.app/release/tauri/v1.0.0-rc.2/) * [ 1.0.0-rc.1 ](https://tauri.app/release/tauri/v1.0.0-rc.1/) * [ 1.0.0-rc.0 ](https://tauri.app/release/tauri/v1.0.0-rc.0/) * [ 1.0.0-beta-rc.4 ](https://tauri.app/release/tauri/v1.0.0-beta-rc.4/) * [ 1.0.
Printing pages to PDFs, combining them, and converting to Markdown via Docling for use in Archon is a solid manual approach to curate content, but it's time-consuming and doesn't scale well for broader crawls. The core issues you're facing (noisy chunks from headers, footers, sidebars, and irrelevant link lists like the Tauri releases example) stem from two main areas: inadequate preprocessing of raw web data and suboptimal chunking strategies. Below, I'll outline targeted improvements, drawing from established best practices. These focus on cleaning data upfront, smarter chunking, and post-chunk filtering to reduce useless chunks while preserving context for coding projects or similar use cases.
1. Enhance Preprocessing: Extract Main Content Before Crawling or Chunking
The default "knowledge crawl" (assuming this is a basic web scraper or tool like in many RAG frameworks) grabs everything indiscriminately, leading to hundreds of noisy chunks. Shift to intelligent extraction that strips away boilerplate (e.g., navigation, ads, footers) and focuses on the core article or documentation.
Recommended Tools for Main Content Extraction:
Firecrawl: This is an AI-powered web crawler specifically designed for RAG pipelines. It scrapes websites, extracts clean Markdown (handling structure like headings and code blocks), and skips irrelevant sections. For a site like tauri.app, it would ignore release lists unless they're embedded in main content. Start by integrating it into your pipeline: Provide a URL, and it outputs structured MD ready for chunking. It's open-source on GitHub and optimized for AI apps.
Readability.js (via Node.js or Python wrappers): A lightweight library from Mozilla that parses HTML and pulls out just the readable article body, discarding headers, footers, and sidebars. Use it in a script to process crawled pages before PDF conversion. For example, in Python with
readability-lxml, you can fetch a page, clean it, and save as text/MD. This would eliminate chunks like your Tauri link list entirely if it's in a sidebar.Unstructured.io: A library for ingesting and cleaning various data sources, including web pages. It has built-in chunking but excels at removing noise during extraction. Feed it URLs or HTML, and it outputs cleaned text chunks with options for smart splitting.
Other Options: For custom needs, use BeautifulSoup (Python) with heuristics to target
<main>or<article>tags, or Crawl4AI for AI-assisted crawling that focuses on relevant sections.
Implementation Tip: Automate this in your workflow. Instead of manual PDF printing, script a crawler: Fetch URLs → Extract main content with one of the above → Convert to single MD file. This replaces your PDF step and reduces noise by 70-90% upfront. For multi-page sites, use sitemaps or recursive crawling with depth limits to target only relevant sections (e.g., docs, not releases). When wiring this into Archon, plug the extractor into
SiteConfig.get_link_pruning_markdown_generator()so the recursive crawler uses the cleaned HTML/Markdown before chunking.
2. Adopt Advanced Chunking Strategies
Once you have cleaner input (e.g., MD from Docling or Firecrawl), move beyond naive splitting, which often creates broken or useless chunks like your example (a truncated link list with no semantic value). Aim for chunks that maintain context while being retrievable.
Key Strategies:
Strategy Description Pros Cons When to Use Fixed-Size Chunking Split text into uniform sizes (e.g., 300-500 tokens) with overlap (20-30%). Simple, fast. Breaks mid-sentence, includes noise if not pre-cleaned. Quick prototypes; pair with filtering. Recursive/Structure-Based Chunking Split hierarchically: First by sections/headings (e.g., Markdown H1/H2), then paragraphs, then sentences. Libraries like LangChain's RecursiveCharacterTextSplitter handle this. Preserves document structure, avoids breaking code blocks or lists. Requires structured input like MD. Documentation sites like Tauri, where content has clear sections. Semantic Chunking Embed sentences (using models like Sentence Transformers or OpenAI embeddings), then group based on similarity thresholds (e.g., cosine >0.7). Break when similarity drops. Creates meaningful, context-aware chunks; reduces irrelevant ones like pure link lists. Computationally heavier (needs embeddings). Noisy or unstructured text; improves retrieval accuracy by 20-30% in tests. Content-Aware/Hybrid Chunking Combine structure with semantics; e.g., chunk by pages/sections but refine with embeddings. Add metadata (e.g., source URL, heading) to each chunk for better retrieval. Balances speed and quality; handles varied content types. Needs tuning for thresholds. Mixed web content; use for coding projects to keep API/code examples intact. Page-Level or Metadata Chunking For PDFs/docs, chunk by pages but attach summaries or keywords. Useful if sticking with your PDF workflow. Simple for scanned/printed pages. Loses cross-page context without overlap. Medical/legal docs, but adapt for web. Best Starting Point: Switch to semantic chunking for your use case. It would detect your Tauri link chunk as low-similarity (mostly repetitive URLs) and either merge it with context or discard it. Implement in Python with LangChain or LlamaIndex:
from langchain_text_splitters import SemanticChunker
from langchain_openai import OpenAIEmbeddings # Or use HuggingFace embeddings
embedder = OpenAIEmbeddings()
chunker = SemanticChunker(embedder, breakpoint_threshold_type="percentile")
chunks = chunker.split_text(your_md_text)Tune the threshold (e.g., percentile=95) to avoid tiny chunks.
Overlap and Size Guidelines: Always add 10-20% overlap between chunks to fix broken links/context. Target 200-1024 tokens per chunk—smaller for precise retrieval, larger for context-heavy tasks like code generation.
3. Post-Chunk Filtering and Quality Control
Even with better extraction and chunking, some noise slips through. Add a filtering layer:
Criteria to Discard Chunks:
Too short (<50 tokens) or too long (>2000 tokens).
High link density: If >40% of text is URLs (detect via regex like
http[s]?://count), skip—as in your Tauri example.Low relevance: Embed the chunk and compare to a query embedding (e.g., "Tauri coding guide"); discard if similarity <0.5.
Duplicates: Use hashing or similarity checks to remove near-identical chunks (e.g., repeated headers).
Implementation Hook: Introduce these checks inside
DocumentStorageOperations.process_and_store_documentsright before appending toall_contents. This keeps the logic close to the current chunk accumulator and prevents link-only chunks from ever reaching Supabase.Tools: In LangChain or custom scripts, apply filters post-chunking. For noise handling, note that while some research shows "noise" (irrelevant docs) can oddly boost RAG in specific setups, focus on removal for your scenario to avoid dilution.
Evaluation: Test retrieval quality with metrics like recall@K or semantic similarity. Tools like RAGAS can automate this.
4. Overall Pipeline Optimization
Updated Workflow: URL(s) → Crawl/Extract main content (Firecrawl/Readability) → Clean MD → Semantic chunking (LangChain) → Filter chunks → Embed and store in vector DB (e.g., for Archon).
Scalability Tips: For large sites, limit crawl depth to 1-2 and use sitemaps. If dealing with dynamic content, add periodic re-crawls but only update changed chunks via hashing.
Potential Pitfalls: Over-cleaning can lose useful side info (e.g., related links in docs), so test on a sample like tauri.app. If your crawl tool supports it, add custom instructions like "extract only /docs/ paths."
This should cut useless chunks dramatically, making retrieval more efficient for coding projects. If you share more details on your tools (e.g., what "knowledge crawl" or Archon entails), I can refine further.
5. Targeted Code Changes in Archon
To make the guidance above actionable, plan for the following commits:
Tighten DOM pruning during crawl
File:
python/src/server/services/crawling/helpers/site_config.pyUpdate
SiteConfig.get_link_pruning_markdown_generator()to either lower thePruningContentFilterthreshold or replace the current generator with a readability-based extractor (Firecrawl, Readability.js wrapper).Add site-specific exclusion lists (e.g., ignore
nav,aside,.release-list) so link-only sidebars never enter the Markdown stream.
Add chunk-level quality gate
File:
python/src/server/services/crawling/document_storage_operations.pyInside
process_and_store_documents, before appending toall_contents, run heuristics described in section 3 (length, link ratio, boilerplate phrases).Skip any chunk that fails the checks and log how many were removed for observability.
Swap to semantic chunking
File:
python/src/server/services/storage/base_storage_service.py(andDocumentStorageService)Replace the current
smart_chunk_text_asynccall with a semantic splitter. For example, integrate LangChain’sSemanticChunker(defaulting to OpenAI embeddings) and fall back to the existing logic when embeddings are unavailable.Store the chunking mode in metadata (
metadata["chunking_strategy"] = "semantic") so we can A/B test retrieval impact.
Expose per-domain overrides
File:
python/src/server/services/crawling/helpers/site_config.pyAdd a registry mapping hostnames to CSS selector allow/deny lists. When
SiteConfig.is_documentation_siteis true, merge the global defaults with the domain-specific rules. This lets us suppress the Tauri release sidebar without harming other doc sites.
Regression tests & telemetry
Create a focused crawler test in
python/tests/test_crawl_orchestration_isolated.pythat feeds a canned HTML page containing a sidebar of releases plus a main article. Assert that only the article chunks reach storage.Emit counters (
chunks_filtered_total,chunk_filter_reason) via Logfire so we can monitor the effectiveness of the heuristics in production.
Oh, yeah also hoping for this!
Usecase: Go here and check the hundreds of pages regarding release notes: https://python-binance.readthedocs.io/en/latest/
When I search the documentation from Archon it is overloaded with useless release notes that shows up forcing me to click "Load More documents" again and again to skip these.