Archon icon indicating copy to clipboard operation
Archon copied to clipboard

Fix #825: Handle relative URLs in sitemap parsing

Open thiagomaf opened this issue 2 months ago β€’ 1 comments

Fix #825: Handle relative URLs in sitemap parsing

πŸ› Problem

Sitemap ingestion fails when XML contains relative URLs instead of absolute URLs.

Original Error

[ERROR]... Γ— /docs/apps/ | Error: 
Unexpected error in _crawl_web at line 500:
Error: URL must start with 'http://', 'https://', 'file://', or 'raw:'

Root Cause: Many sitemap generators output relative URLs (e.g., /docs/apps/) to keep the XML portable across domains. The sitemap parser was extracting these relative URLs as-is without composing them into absolute URLs, causing the crawler to reject them.

Example Problematic Sitemap

<url>
  <loc>/docs/apps/</loc>
  <lastmod>2020-10-06T08:48:45+00:00</lastmod>
</url>

βœ… Solution

Modified sitemap.py to automatically compose relative URLs to absolute URLs using urllib.parse.urljoin.

How it Works

# Input:  sitemap_url = "https://waha.devlike.pro/docs/sitemap.xml"
#         relative_url = "/docs/apps/"
# Output: "https://waha.devlike.pro/docs/apps/"

Changes Made

  1. Added imports: urljoin and urlparse from urllib.parse
  2. Enhanced parsing logic:
    • Detects if extracted URL is absolute (has scheme + netloc)
    • If relative, composes absolute URL using urljoin(sitemap_url, relative_url)
    • Validates composed URLs have valid HTTP(S) scheme and netloc
    • Skips invalid URLs with warnings
  3. Improved logging: Debug logs for composed URLs

πŸ§ͺ Testing

Comprehensive Test Suite

Added test_sitemap_relative_urls.py with 10 test cases:

Test Coverage
βœ… Absolute URLs Preserved unchanged
βœ… Relative URLs Composed to absolute
βœ… Mixed sitemaps Both types handled
βœ… Subdirectory paths Parent-relative (../) works
βœ… Invalid URLs Gracefully skipped
βœ… HTTP errors Handled gracefully
βœ… Network errors Handled gracefully
βœ… Malformed XML Handled gracefully
βœ… Whitespace trimming URLs cleaned properly
βœ… Real-world example waha.devlike.pro scenario

Test Results

$ pytest tests/test_sitemap_relative_urls.py -v
======================== 10 passed in 1.19s ========================

$ pytest tests/test_url_handler.py -v
======================== 15 passed in 1.05s ========================

Manual Testing

βœ… Successfully crawled https://waha.devlike.pro/docs/sitemap.xml βœ… All relative URLs composed correctly βœ… No more "URL must start with 'http://'" errors βœ… Task completes without disappearing from UI


πŸ“Š Impact

Before Fix ❌

  • Sitemaps with relative URLs fail to crawl
  • Users get cryptic error messages in logs
  • Tasks disappear from UI without explanation
  • Common sitemap format not supported

After Fix βœ…

  • Relative URLs automatically composed to absolute
  • Portable sitemaps now work
  • Better error handling with warnings
  • Backward compatible (absolute URLs unchanged)
  • Clear debug logging

πŸ” Files Changed

python/
β”œβ”€β”€ src/server/services/crawling/strategies/
β”‚   └── sitemap.py                           # Modified: +28 lines
└── tests/
    └── test_sitemap_relative_urls.py        # New: 252 lines

✨ Benefits

  1. Broader Compatibility: Supports sitemaps from more generators
  2. Better UX: No cryptic errors for common sitemap formats
  3. Portable Sitemaps: Sites can use relative URLs for flexibility
  4. Robust Validation: Invalid URLs skipped gracefully
  5. Backward Compatible: Existing absolute URLs work unchanged
  6. Better Logging: Debug info for troubleshooting

πŸ”— Related Issues

Resolves #825


βœ… Checklist

  • [x] Code follows project style guidelines (Ruff, MyPy compliant)
  • [x] All tests pass (10/10 new tests + 15/15 existing tests)
  • [x] No regressions in existing functionality
  • [x] Backward compatible with existing sitemaps
  • [x] Comprehensive test coverage added
  • [x] Manual testing completed successfully
  • [x] Documentation updated (via code comments and docstrings)

πŸ“Έ Test Evidence

Successfully crawled sitemap that was previously failing:

INFO | Parsing sitemap: https://waha.devlike.pro/docs/sitemap.xml
INFO | Successfully extracted 76 URLs from sitemap (composed relative URLs)
INFO | Inserted batch 4 of 4 code examples

πŸ™ Additional Notes

This fix addresses the sitemap parsing issue. The UI error handling (tasks disappearing without notification) mentioned in #825 could be addressed in a follow-up PR if desired.


Tested on:

  • Docker: archon-server container
  • Python: 3.12.12
  • All services healthy and functional

Summary by CodeRabbit

Bug Fixes

  • Improved sitemap parsing to correctly handle both absolute and relative URLs
  • Relative URLs in sitemaps are now properly converted to absolute URLs based on sitemap context
  • Enhanced error handling and validation during URL extraction with detailed logging
  • Better handling of malformed XML, network errors, and HTTP errors in sitemap processing

thiagomaf avatar Oct 27 '25 09:10 thiagomaf

Walkthrough

This change addresses bug #825 by implementing relative URL conversion in sitemap parsing. The parser now uses urljoin to compose absolute URLs from the sitemap base URL and relative paths, while preserving already-absolute URLs. Enhanced error handling includes cancellation checks and per-URL exception handling, with detailed logging of successes and composition attempts.

Changes

Cohort / File(s) Summary
Sitemap Parser Enhancement
python/src/server/services/crawling/strategies/sitemap.py
Added URL utilities (urlparse, urljoin) to normalize relative URLs to absolute URLs. Replaces direct text extraction with two-step process: collect raw URLs, then compose and validate relative ones. Introduces optional cancellation check, nested try/except blocks for per-URL processing, and expanded logging for successes, composition attempts, and composition failures. Return type documents and logs absolute URLs.
Relative URL Test Suite
python/tests/test_sitemap_relative_urls.py
New comprehensive test module verifying sitemap parsing with mixed absolute/relative URLs. Tests URL preservation, composition, subdirectory/parent-relative resolution, invalid URL filtering, HTTP/XML error handling, whitespace trimming, and real-world scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant parse_sitemap
    participant HTTP
    participant XMLParser
    participant URLNormalizer

    Caller->>parse_sitemap: parse_sitemap(sitemap_url)
    
    note over parse_sitemap: Check for cancellation
    
    parse_sitemap->>HTTP: Fetch sitemap XML (30s timeout)
    HTTP-->>parse_sitemap: XML content or error
    
    alt HTTP Error
        parse_sitemap-->>Caller: Empty/partial list
    else Success
        parse_sitemap->>XMLParser: Parse XML
        XMLParser-->>parse_sitemap: Raw loc elements
        
        loop For each loc
            parse_sitemap->>URLNormalizer: Normalize URL
            
            alt Already absolute (http/https)
                URLNormalizer-->>parse_sitemap: Keep URL as-is (log success)
            else Relative URL
                URLNormalizer->>URLNormalizer: Compose with sitemap base
                
                alt Valid absolute result
                    URLNormalizer-->>parse_sitemap: Absolute URL (log composition)
                else Invalid composition
                    URLNormalizer-->>parse_sitemap: Skip + warn
                end
            else Invalid format
                URLNormalizer-->>parse_sitemap: Skip + debug
            end
        end
        
        parse_sitemap-->>Caller: list[str] (absolute URLs)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas requiring extra attention:

  • URL composition logic: validate that urljoin correctly handles edge cases (parent-relative paths, subdomories, trailing slashes)
  • Error handling flow: verify nested try/except blocks properly isolate per-URL errors without breaking overall parsing
  • Cancellation handling: ensure asyncio.CancelledError check is correctly positioned and doesn't mask other errors
  • Logging consistency: confirm log messages accurately reflect URL transformation steps and edge case handling
  • Return contract: validate that all returned URLs are guaranteed to be absolute and valid (http/https with netloc)

Poem

🐰 A sitemap's paths, once lost and small, Now compass-bound, they're found at all! From /docs/ whispers to full URLs bright, Relative becomes absoluteβ€”what a sight! ✨

Pre-merge checks and finishing touches

βœ… Passed checks (5 passed)
Check name Status Explanation
Title Check βœ… Passed The PR title "Fix #825: Handle relative URLs in sitemap parsing" clearly and directly describes the main change in the pull request. It accurately reflects that this is a bug fix addressing issue #825 and specifically targets the handling of relative URLs in sitemap parsing. The title is concise, free of noise (no emojis or vague terms), and provides sufficient context for a developer scanning the repository history to understand the primary change.
Linked Issues Check βœ… Passed The pull request successfully addresses the primary coding-related objectives from issue #825. It implements automatic URL composition using urllib.parse.urljoin() to convert relative URLs to absolute URLs [#825], validates composed URLs to ensure valid HTTP(S) schemes and netloc [#825], and updates the sitemap parsing logic in python/src/server/services/crawling/strategies/sitemap.py [#825]. The PR provides graceful error handling by skipping invalid URLs with warnings and includes comprehensive test coverage (10 new test cases) that validate absolute URLs, relative URLs, mixed sitemaps, parent-relative paths, invalid URLs, and error scenarios. The preferred solution approach (automatic composition) was selected from the issue's options rather than the fallback option. The PR notes that UI-level error visibility improvements are deferred to a potential follow-up PR.
Out of Scope Changes Check βœ… Passed All changes in the pull request are directly related to the objective of handling relative URLs in sitemap parsing. The modifications to sitemap.py include adding URL utility imports, implementing relative-to-absolute URL composition logic, enhancing error handling with try/except blocks and cancellation checks, and improving loggingβ€”all of which are necessary to fulfill the PR objectives. The new test file test_sitemap_relative_urls.py with 10 test cases comprehensively validates the implemented functionality for absolute URLs, relative URLs, mixed formats, subdirectory paths, invalid URLs, HTTP/network errors, malformed XML, and real-world examples. No extraneous changes outside the scope of relative URL handling in sitemap parsing are evident in the provided summaries.
Description Check βœ… Passed The PR description is comprehensive and covers all essential information needed to understand the changes, though it deviates from the template format by using custom emoji-based sections instead of the template's checkbox-based structure. The description includes a clear problem statement, solution explanation with examples, detailed testing evidence with 10 test cases, impact analysis, and file changes. While the template's "Type of Change", "Affected Services", and structured "Testing" sections are not filled out exactly as specified, the description compensates by providing thorough contextual information. The custom checklist and additional notes provide equivalent or more detailed content than the template would require.
Docstring Coverage βœ… Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • [ ] πŸ“ Generate docstrings
πŸ§ͺ Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Oct 27 '25 09:10 coderabbitai[bot]