Archon icon indicating copy to clipboard operation
Archon copied to clipboard

feat: Advanced web crawling with domain filtering

Open leex279 opened this issue 3 months ago β€’ 9 comments

Summary

This PR implements Advanced Web Crawling with Domain Filtering capabilities for Archon's knowledge management system, along with comprehensive Edit Crawler Configuration and Enhanced Metadata Viewing features.

https://github.com/user-attachments/assets/6c058100-37d4-40a9-8121-9a33a314a8f6

https://github.com/user-attachments/assets/300e46b6-224d-4a08-8819-2349012557ca

https://github.com/user-attachments/assets/bd759b58-26df-4347-8e2c-49f72d0dd7e8

New Features Added

πŸ”§ Edit Crawler Configuration

  • Edit existing crawler settings: Users can now modify any existing crawler configuration including URL, knowledge type, max depth, tags, and advanced domain filtering settings
  • Edit Configuration menu item: Added to knowledge card actions dropdown, positioned between View options and Recrawl
  • Recrawl warning: Clear warning message that saving changes will trigger a recrawl and replace existing documents
  • Complete configuration editing: Reuses the same AdvancedCrawlConfig component from the add dialog for consistency
  • Progress tracking: Full integration with existing progress tracking system for recrawl operations

πŸ“Š Enhanced Metadata Viewing

  • Complete metadata display: Fixed to show ALL metadata properties from backend (previously only showed 3-5 fields)
  • Improved viewing area: Increased metadata panel height from 256px to 500px with proper scrolling
  • Full JSON display: Now shows complete metadata including:
    • url, source, headers, filename
    • has_code, has_links, source_id
    • char_count, word_count, line_count
    • chunk_size, chunk_index
    • source_type, knowledge_type
    • All other backend metadata fields
  • Better UX: Visible scrollbar and proper text wrapping for large metadata objects

🌐 Advanced Domain Filtering

  • Whitelist domains: Only crawl pages from specified domains
  • Blacklist domains: Exclude specific domains from crawling
  • URL patterns: Include/exclude based on URL pattern matching
  • Priority system: Blacklist > Whitelist > Patterns for conflict resolution
  • Domain statistics: Shows document count per domain in filter dropdown
  • "All domains" option: Easy way to view all documents across domains

🎯 Improved Document Browser

  • Domain filter dropdown: Replaced pills with cleaner dropdown interface showing domain statistics
  • Enhanced sidebar: Better search and filtering experience in Knowledge Inspector
  • Metadata panel: Collapsible metadata viewer in content area

Technical Implementation

Frontend Architecture

  • Vertical slice pattern: All edit configuration features organized in knowledge feature folder
  • React Query integration: useUpdateCrawlConfig hook with optimistic updates and error handling
  • Component reuse: EditCrawlConfigDialog reuses existing AdvancedCrawlConfig, KnowledgeTypeSelector, etc.
  • TypeScript: Full type safety with proper CrawlConfig interfaces

Backend Integration

  • New API endpoint: POST /api/knowledge-items/{source_id}/update-config
  • Configuration validation: Uses existing CrawlRequestV2 and CrawlConfig Pydantic models
  • Safe recrawling: Deletes old data before starting new crawl with updated configuration
  • Progress tracking: Returns progress ID for monitoring recrawl status

Key Files Modified/Added

Frontend Components

  • src/features/knowledge/components/EditCrawlConfigDialog.tsx ⭐ NEW
  • src/features/knowledge/components/KnowledgeCardActions.tsx
  • src/features/knowledge/components/KnowledgeCard.tsx
  • src/features/knowledge/inspector/components/ContentViewer.tsx
  • src/features/knowledge/inspector/components/KnowledgeInspector.tsx
  • src/features/knowledge/inspector/components/InspectorSidebar.tsx

Services & Hooks

  • src/features/knowledge/hooks/useKnowledgeQueries.ts - Added useUpdateCrawlConfig
  • src/features/knowledge/services/knowledgeService.ts - Added updateCrawlConfig

Backend API

  • python/src/server/api_routes/knowledge_api.py - Added update-config endpoint

User Experience Flow

  1. Editing Configuration:

    • User clicks three-dot menu on knowledge card β†’ "Edit Configuration"
    • Dialog opens with current settings pre-loaded
    • User can modify any crawler settings including advanced domain filtering
    • Warning shows that recrawl will be triggered
    • Save triggers new crawl with updated configuration
  2. Enhanced Metadata Viewing:

    • User opens Knowledge Inspector and selects any document
    • Metadata panel at bottom shows complete JSON with all properties
    • 500px height with scrolling for large metadata objects
    • All backend fields visible for debugging and analysis
  3. Domain Filtering:

    • User adds advanced crawl configuration with domain filters
    • During crawl, domains are filtered according to whitelist/blacklist/patterns
    • In document browser, user can filter by domain to validate results
    • Domain dropdown shows statistics (e.g., "example.com (15 documents)")

Testing & Validation

  • βœ… TypeScript compilation passes
  • βœ… Frontend build completes successfully
  • βœ… All new components follow Archon's design patterns
  • βœ… Backend API endpoint follows existing patterns
  • βœ… Optimistic updates work correctly
  • βœ… Progress tracking integrates with existing system
  • βœ… Complete metadata display verified with real data

Benefits

  • Better content control: Users can precisely control what gets crawled using domain filtering
  • Configuration flexibility: Easy editing of existing crawler settings without re-creating from scratch
  • Enhanced debugging: Complete metadata visibility helps users understand and debug crawl results
  • Improved UX: Cleaner domain filtering interface and better metadata viewing experience
  • Validation capabilities: Users can easily verify that domain filtering worked as expected

This implementation provides a comprehensive solution for advanced web crawling with full configurability and excellent user experience for managing and analyzing crawled content.

Summary by CodeRabbit

  • New Features

    • Crawl V2: start crawls with domain & URL-pattern filtering and update/save per-item crawl configurations; new "Edit Configuration" action to open the editor.
    • Domain-aware browsing: filter and view documents by domain; metadata/footer shows source domain and quick links.
  • UI/UX Improvements

    • Advanced crawl options panel, clearer dialog/tab layouts, loading/toast messages reflect domain-filtered crawls, and improved scroll/height behavior.
  • Bug Fixes

    • More accurate combined text + domain filtering and reset behavior.
  • Tests

    • Added comprehensive unit tests for domain filtering logic.

leex279 avatar Sep 22 '25 07:09 leex279

[!IMPORTANT]

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds Crawl V2 with domain/pattern filtering across UI, hooks, types, services, and server; introduces AdvancedCrawlConfig UI, domain-aware browsing/inspector filters, update-crawl-config flow, DomainFilter logic in crawler, and tests.

Changes

Cohort / File(s) Summary
Add Knowledge dialog & Advanced config UI
archon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx, archon-ui-main/src/features/knowledge/components/AdvancedCrawlConfig.tsx
Add AdvancedCrawlConfig component and crawlConfig state; choose between v1 and v2 crawl requests; adjust layout, toasts, reset, and processing state to account for v2.
Knowledge card actions & edit dialog
archon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx, archon-ui-main/src/features/knowledge/components/KnowledgeCardActions.tsx, archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx, archon-ui-main/src/features/knowledge/components/index.ts
Add EditCrawlConfigDialog, wire into KnowledgeCard/KnowledgeCardActions via new onEditConfig prop, expose AdvancedCrawlConfig via index export; enable editing and saving crawl_config (triggers recrawl).
Document browser & inspector UI
archon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx, archon-ui-main/src/features/knowledge/inspector/components/ContentViewer.tsx, archon-ui-main/src/features/knowledge/inspector/components/InspectorSidebar.tsx, archon-ui-main/src/features/knowledge/inspector/components/KnowledgeInspector.tsx
Compute domainStats, add domain filter UI and domain links, toggle metadata panel, restructure layouts, wire selectedDomain state and filtering through sidebar/inspector/content viewer.
Hooks, services & types (client)
archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts, archon-ui-main/src/features/knowledge/services/knowledgeService.ts, archon-ui-main/src/features/knowledge/types/knowledge.ts
Add types CrawlConfig/CrawlRequestV2; add useCrawlUrlV2 and useUpdateCrawlConfig with optimistic updates; add service methods crawlUrlV2 and updateCrawlConfig.
Server API & Pydantic models
python/src/server/api_routes/knowledge_api.py, python/src/server/models/crawl_models.py
Add endpoints for /crawl-v2 and update-crawl-config; introduce CrawlConfig and CrawlRequestV2 models with validation; spawn background v2 crawl runner and progress tracking.
Crawler services & domain filtering
python/src/server/services/crawling/crawling_service.py, python/src/server/services/crawling/domain_filter.py, python/src/server/services/crawling/strategies/recursive.py, python/src/server/services/crawling/strategies/batch.py
Introduce DomainFilter, thread crawl_config through CrawlingService and strategies, apply domain/pattern checks when expanding/processing URLs; update constructors/signatures and logging/progress behavior.
Tests
python/src/server/services/tests/test_domain_filter.py
Add comprehensive unit tests covering domain normalization, whitelist/blacklist precedence, wildcard/subdomain matching, include/exclude patterns, relative URLs, and edge cases.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant UI as AddKnowledgeDialog
  participant H as useCrawlUrl / useCrawlUrlV2
  participant S as knowledgeService
  participant API as /api/knowledge-items/crawl[-v2]
  participant BG as Background Crawl Runner
  participant P as ProgressTracker

  U->>UI: Click "Start Crawling"
  UI->>H: submit(url, crawl_config?)
  alt crawl_config present (v2)
    H->>S: crawlUrlV2({url,...,crawl_config})
    S->>API: POST /crawl-v2
  else no crawl_config (v1)
    H->>S: crawlUrl({url,...})
    S->>API: POST /crawl
  end
  API->>P: create progressId
  API-->>S: {progressId, metadata}
  S-->>H: response
  H->>H: optimistic updates (cache)
  API->>BG: start async crawl task
  BG->>P: update progress
  P-->>UI: polled progress updates
sequenceDiagram
  autonumber
  actor U as User
  participant KC as KnowledgeCard
  participant D as EditCrawlConfigDialog
  participant H as useUpdateCrawlConfig
  participant S as knowledgeService
  participant API as POST /knowledge-items/{id}/update-config
  participant BG as Recrawl Runner
  participant P as ProgressTracker

  U->>KC: Open menu β†’ Edit Configuration
  KC->>D: open(sourceId)
  U->>D: Save
  D->>H: mutate({sourceId, url, max_depth, tags, crawl_config})
  H->>S: updateCrawlConfig(request)
  S->>API: POST update-config
  API->>P: create progressId
  API-->>S: {progressId}
  S-->>H: ok
  H->>H: optimistic processing state
  API->>BG: start recrawl
  BG->>P: progress updates
sequenceDiagram
  autonumber
  participant R as RecursiveCrawlStrategy
  participant DF as DomainFilter
  participant CFG as CrawlConfig
  participant Q as URL Queue

  loop discovered next_url
    R->>DF: is_url_allowed(next_url, base_url, CFG)
    alt allowed
      DF-->>R: true
      R->>Q: enqueue next_url
    else blocked
      DF-->>R: false
      R->>R: skip & log
    end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • coleam00/Archon#395 β€” Modifies BatchCrawlStrategy/RecursiveCrawlStrategy; intersects with domain-filter plumbing and strategy changes.
  • coleam00/Archon#661 β€” Changes AddKnowledgeDialog UI; overlaps with dialog/tab/layout adjustments and crawl-v2 integration.
  • coleam00/Archon#707 β€” Alters useKnowledgeQueries optimistic/cache logic; closely related to new useCrawlUrlV2/useUpdateCrawlConfig behavior.

Suggested reviewers

  • coleam00
  • Wirasm
  • tazmon95

Poem

A nibble of links, a hop through domains,
I sift with whiskers, follow careful lanes.
Wildcards twinkle, patterns neatly spun,
V2 drums rollingβ€”recrawls have begun.
Filters in pawβ€”clean knowledge, crystal clear! πŸ‡βœ¨

Pre-merge checks and finishing touches

βœ… Passed checks (3 passed)
Check name Status Explanation
Title Check βœ… Passed The title "feat: Advanced web crawling with domain filtering" is concise, a single sentence, and accurately captures the primary change in the changeset (adding advanced crawling with domain/domain-pattern filtering across frontend and backend). It is specific enough for a teammate scanning history and avoids noisy file lists or vague terms. The "feat:" prefix follows common conventional commit style used in many repos.
Description Check βœ… Passed The PR description is thorough and maps well to the provided diffs, containing a clear summary, detailed feature descriptions, technical implementation notes, key files changed, and a testing/validation section, so reviewers can understand scope and intent. It does not strictly follow the repository's required template structure: the explicit "Changes Made" bullet list, the "Type of Change" checkboxes, the "Affected Services" checkbox matrix, the formal Testing checklist with a Test Evidence code block, and the Checklist/Breaking Changes sections from the template are missing or not presented in the exact template format. Because substantive content is present and useful, this is rated as a pass while recommending alignment with the template for consistency and automation.
Docstring Coverage βœ… Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Sep 22 '25 07:09 coderabbitai[bot]

@coderabbit review

leex279 avatar Sep 22 '25 11:09 leex279

βœ… Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai[bot] avatar Sep 22 '25 11:09 coderabbitai[bot]

@coderabbit review

leex279 avatar Sep 22 '25 13:09 leex279

βœ… Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai[bot] avatar Sep 22 '25 13:09 coderabbitai[bot]

@coderabbit review

leex279 avatar Sep 22 '25 14:09 leex279

βœ… Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai[bot] avatar Sep 22 '25 14:09 coderabbitai[bot]

@leex279 is this something we are still considering?

Wirasm avatar Nov 24 '25 09:11 Wirasm

@Wirasm yes, need to finish that.

leex279 avatar Nov 25 '25 08:11 leex279