feat: Advanced web crawling with domain filtering
Summary
This PR implements Advanced Web Crawling with Domain Filtering capabilities for Archon's knowledge management system, along with comprehensive Edit Crawler Configuration and Enhanced Metadata Viewing features.
https://github.com/user-attachments/assets/6c058100-37d4-40a9-8121-9a33a314a8f6
https://github.com/user-attachments/assets/300e46b6-224d-4a08-8819-2349012557ca
https://github.com/user-attachments/assets/bd759b58-26df-4347-8e2c-49f72d0dd7e8
New Features Added
π§ Edit Crawler Configuration
- Edit existing crawler settings: Users can now modify any existing crawler configuration including URL, knowledge type, max depth, tags, and advanced domain filtering settings
- Edit Configuration menu item: Added to knowledge card actions dropdown, positioned between View options and Recrawl
- Recrawl warning: Clear warning message that saving changes will trigger a recrawl and replace existing documents
- Complete configuration editing: Reuses the same AdvancedCrawlConfig component from the add dialog for consistency
- Progress tracking: Full integration with existing progress tracking system for recrawl operations
π Enhanced Metadata Viewing
- Complete metadata display: Fixed to show ALL metadata properties from backend (previously only showed 3-5 fields)
- Improved viewing area: Increased metadata panel height from 256px to 500px with proper scrolling
- Full JSON display: Now shows complete metadata including:
url,source,headers,filenamehas_code,has_links,source_idchar_count,word_count,line_countchunk_size,chunk_indexsource_type,knowledge_type- All other backend metadata fields
- Better UX: Visible scrollbar and proper text wrapping for large metadata objects
π Advanced Domain Filtering
- Whitelist domains: Only crawl pages from specified domains
- Blacklist domains: Exclude specific domains from crawling
- URL patterns: Include/exclude based on URL pattern matching
- Priority system: Blacklist > Whitelist > Patterns for conflict resolution
- Domain statistics: Shows document count per domain in filter dropdown
- "All domains" option: Easy way to view all documents across domains
π― Improved Document Browser
- Domain filter dropdown: Replaced pills with cleaner dropdown interface showing domain statistics
- Enhanced sidebar: Better search and filtering experience in Knowledge Inspector
- Metadata panel: Collapsible metadata viewer in content area
Technical Implementation
Frontend Architecture
- Vertical slice pattern: All edit configuration features organized in knowledge feature folder
- React Query integration:
useUpdateCrawlConfighook with optimistic updates and error handling - Component reuse: EditCrawlConfigDialog reuses existing AdvancedCrawlConfig, KnowledgeTypeSelector, etc.
- TypeScript: Full type safety with proper CrawlConfig interfaces
Backend Integration
- New API endpoint:
POST /api/knowledge-items/{source_id}/update-config - Configuration validation: Uses existing CrawlRequestV2 and CrawlConfig Pydantic models
- Safe recrawling: Deletes old data before starting new crawl with updated configuration
- Progress tracking: Returns progress ID for monitoring recrawl status
Key Files Modified/Added
Frontend Components
src/features/knowledge/components/EditCrawlConfigDialog.tsxβ NEWsrc/features/knowledge/components/KnowledgeCardActions.tsxsrc/features/knowledge/components/KnowledgeCard.tsxsrc/features/knowledge/inspector/components/ContentViewer.tsxsrc/features/knowledge/inspector/components/KnowledgeInspector.tsxsrc/features/knowledge/inspector/components/InspectorSidebar.tsx
Services & Hooks
src/features/knowledge/hooks/useKnowledgeQueries.ts- AddeduseUpdateCrawlConfigsrc/features/knowledge/services/knowledgeService.ts- AddedupdateCrawlConfig
Backend API
python/src/server/api_routes/knowledge_api.py- Added update-config endpoint
User Experience Flow
-
Editing Configuration:
- User clicks three-dot menu on knowledge card β "Edit Configuration"
- Dialog opens with current settings pre-loaded
- User can modify any crawler settings including advanced domain filtering
- Warning shows that recrawl will be triggered
- Save triggers new crawl with updated configuration
-
Enhanced Metadata Viewing:
- User opens Knowledge Inspector and selects any document
- Metadata panel at bottom shows complete JSON with all properties
- 500px height with scrolling for large metadata objects
- All backend fields visible for debugging and analysis
-
Domain Filtering:
- User adds advanced crawl configuration with domain filters
- During crawl, domains are filtered according to whitelist/blacklist/patterns
- In document browser, user can filter by domain to validate results
- Domain dropdown shows statistics (e.g., "example.com (15 documents)")
Testing & Validation
- β TypeScript compilation passes
- β Frontend build completes successfully
- β All new components follow Archon's design patterns
- β Backend API endpoint follows existing patterns
- β Optimistic updates work correctly
- β Progress tracking integrates with existing system
- β Complete metadata display verified with real data
Benefits
- Better content control: Users can precisely control what gets crawled using domain filtering
- Configuration flexibility: Easy editing of existing crawler settings without re-creating from scratch
- Enhanced debugging: Complete metadata visibility helps users understand and debug crawl results
- Improved UX: Cleaner domain filtering interface and better metadata viewing experience
- Validation capabilities: Users can easily verify that domain filtering worked as expected
This implementation provides a comprehensive solution for advanced web crawling with full configurability and excellent user experience for managing and analyzing crawled content.
Summary by CodeRabbit
-
New Features
- Crawl V2: start crawls with domain & URL-pattern filtering and update/save per-item crawl configurations; new "Edit Configuration" action to open the editor.
- Domain-aware browsing: filter and view documents by domain; metadata/footer shows source domain and quick links.
-
UI/UX Improvements
- Advanced crawl options panel, clearer dialog/tab layouts, loading/toast messages reflect domain-filtered crawls, and improved scroll/height behavior.
-
Bug Fixes
- More accurate combined text + domain filtering and reset behavior.
-
Tests
- Added comprehensive unit tests for domain filtering logic.
[!IMPORTANT]
Review skipped
Draft detected.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
Walkthrough
Adds Crawl V2 with domain/pattern filtering across UI, hooks, types, services, and server; introduces AdvancedCrawlConfig UI, domain-aware browsing/inspector filters, update-crawl-config flow, DomainFilter logic in crawler, and tests.
Changes
| Cohort / File(s) | Summary |
|---|---|
Add Knowledge dialog & Advanced config UIarchon-ui-main/src/features/knowledge/components/AddKnowledgeDialog.tsx, archon-ui-main/src/features/knowledge/components/AdvancedCrawlConfig.tsx |
Add AdvancedCrawlConfig component and crawlConfig state; choose between v1 and v2 crawl requests; adjust layout, toasts, reset, and processing state to account for v2. |
Knowledge card actions & edit dialogarchon-ui-main/src/features/knowledge/components/KnowledgeCard.tsx, archon-ui-main/src/features/knowledge/components/KnowledgeCardActions.tsx, archon-ui-main/src/features/knowledge/components/EditCrawlConfigDialog.tsx, archon-ui-main/src/features/knowledge/components/index.ts |
Add EditCrawlConfigDialog, wire into KnowledgeCard/KnowledgeCardActions via new onEditConfig prop, expose AdvancedCrawlConfig via index export; enable editing and saving crawl_config (triggers recrawl). |
Document browser & inspector UIarchon-ui-main/src/features/knowledge/components/DocumentBrowser.tsx, archon-ui-main/src/features/knowledge/inspector/components/ContentViewer.tsx, archon-ui-main/src/features/knowledge/inspector/components/InspectorSidebar.tsx, archon-ui-main/src/features/knowledge/inspector/components/KnowledgeInspector.tsx |
Compute domainStats, add domain filter UI and domain links, toggle metadata panel, restructure layouts, wire selectedDomain state and filtering through sidebar/inspector/content viewer. |
Hooks, services & types (client)archon-ui-main/src/features/knowledge/hooks/useKnowledgeQueries.ts, archon-ui-main/src/features/knowledge/services/knowledgeService.ts, archon-ui-main/src/features/knowledge/types/knowledge.ts |
Add types CrawlConfig/CrawlRequestV2; add useCrawlUrlV2 and useUpdateCrawlConfig with optimistic updates; add service methods crawlUrlV2 and updateCrawlConfig. |
Server API & Pydantic modelspython/src/server/api_routes/knowledge_api.py, python/src/server/models/crawl_models.py |
Add endpoints for /crawl-v2 and update-crawl-config; introduce CrawlConfig and CrawlRequestV2 models with validation; spawn background v2 crawl runner and progress tracking. |
Crawler services & domain filteringpython/src/server/services/crawling/crawling_service.py, python/src/server/services/crawling/domain_filter.py, python/src/server/services/crawling/strategies/recursive.py, python/src/server/services/crawling/strategies/batch.py |
Introduce DomainFilter, thread crawl_config through CrawlingService and strategies, apply domain/pattern checks when expanding/processing URLs; update constructors/signatures and logging/progress behavior. |
Testspython/src/server/services/tests/test_domain_filter.py |
Add comprehensive unit tests covering domain normalization, whitelist/blacklist precedence, wildcard/subdomain matching, include/exclude patterns, relative URLs, and edge cases. |
Sequence Diagram(s)
sequenceDiagram
autonumber
actor U as User
participant UI as AddKnowledgeDialog
participant H as useCrawlUrl / useCrawlUrlV2
participant S as knowledgeService
participant API as /api/knowledge-items/crawl[-v2]
participant BG as Background Crawl Runner
participant P as ProgressTracker
U->>UI: Click "Start Crawling"
UI->>H: submit(url, crawl_config?)
alt crawl_config present (v2)
H->>S: crawlUrlV2({url,...,crawl_config})
S->>API: POST /crawl-v2
else no crawl_config (v1)
H->>S: crawlUrl({url,...})
S->>API: POST /crawl
end
API->>P: create progressId
API-->>S: {progressId, metadata}
S-->>H: response
H->>H: optimistic updates (cache)
API->>BG: start async crawl task
BG->>P: update progress
P-->>UI: polled progress updates
sequenceDiagram
autonumber
actor U as User
participant KC as KnowledgeCard
participant D as EditCrawlConfigDialog
participant H as useUpdateCrawlConfig
participant S as knowledgeService
participant API as POST /knowledge-items/{id}/update-config
participant BG as Recrawl Runner
participant P as ProgressTracker
U->>KC: Open menu β Edit Configuration
KC->>D: open(sourceId)
U->>D: Save
D->>H: mutate({sourceId, url, max_depth, tags, crawl_config})
H->>S: updateCrawlConfig(request)
S->>API: POST update-config
API->>P: create progressId
API-->>S: {progressId}
S-->>H: ok
H->>H: optimistic processing state
API->>BG: start recrawl
BG->>P: progress updates
sequenceDiagram
autonumber
participant R as RecursiveCrawlStrategy
participant DF as DomainFilter
participant CFG as CrawlConfig
participant Q as URL Queue
loop discovered next_url
R->>DF: is_url_allowed(next_url, base_url, CFG)
alt allowed
DF-->>R: true
R->>Q: enqueue next_url
else blocked
DF-->>R: false
R->>R: skip & log
end
end
Estimated code review effort
π― 4 (Complex) | β±οΈ ~60 minutes
Possibly related PRs
- coleam00/Archon#395 β Modifies BatchCrawlStrategy/RecursiveCrawlStrategy; intersects with domain-filter plumbing and strategy changes.
- coleam00/Archon#661 β Changes AddKnowledgeDialog UI; overlaps with dialog/tab/layout adjustments and crawl-v2 integration.
- coleam00/Archon#707 β Alters useKnowledgeQueries optimistic/cache logic; closely related to new useCrawlUrlV2/useUpdateCrawlConfig behavior.
Suggested reviewers
- coleam00
- Wirasm
- tazmon95
Poem
A nibble of links, a hop through domains,
I sift with whiskers, follow careful lanes.
Wildcards twinkle, patterns neatly spun,
V2 drums rollingβrecrawls have begun.
Filters in pawβclean knowledge, crystal clear! πβ¨
Pre-merge checks and finishing touches
β Passed checks (3 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title Check | β Passed | The title "feat: Advanced web crawling with domain filtering" is concise, a single sentence, and accurately captures the primary change in the changeset (adding advanced crawling with domain/domain-pattern filtering across frontend and backend). It is specific enough for a teammate scanning history and avoids noisy file lists or vague terms. The "feat:" prefix follows common conventional commit style used in many repos. |
| Description Check | β Passed | The PR description is thorough and maps well to the provided diffs, containing a clear summary, detailed feature descriptions, technical implementation notes, key files changed, and a testing/validation section, so reviewers can understand scope and intent. It does not strictly follow the repository's required template structure: the explicit "Changes Made" bullet list, the "Type of Change" checkboxes, the "Affected Services" checkbox matrix, the formal Testing checklist with a Test Evidence code block, and the Checklist/Breaking Changes sections from the template are missing or not presented in the exact template format. Because substantive content is present and useful, this is rated as a pass while recommending alignment with the template for consistency and automation. |
| Docstring Coverage | β Passed | Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%. |
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
@coderabbit review
β Actions performed
Review triggered.
Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.
@coderabbit review
β Actions performed
Review triggered.
Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.
@coderabbit review
β Actions performed
Review triggered.
Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.
@leex279 is this something we are still considering?
@Wirasm yes, need to finish that.