Fix #825: Handle relative URLs in sitemap parsing

Open thiagomaf opened this issue 2 months ago • 1 comments

Fix #825: Handle relative URLs in sitemap parsing

🐛 Problem

Sitemap ingestion fails when XML contains relative URLs instead of absolute URLs.

Original Error

[ERROR]... × /docs/apps/ | Error: 
Unexpected error in _crawl_web at line 500:
Error: URL must start with 'http://', 'https://', 'file://', or 'raw:'

Root Cause: Many sitemap generators output relative URLs (e.g., /docs/apps/) to keep the XML portable across domains. The sitemap parser was extracting these relative URLs as-is without composing them into absolute URLs, causing the crawler to reject them.

Example Problematic Sitemap

<url>
  <loc>/docs/apps/</loc>
  <lastmod>2020-10-06T08:48:45+00:00</lastmod>
</url>

✅ Solution

Modified sitemap.py to automatically compose relative URLs to absolute URLs using urllib.parse.urljoin.

How it Works

# Input:  sitemap_url = "https://waha.devlike.pro/docs/sitemap.xml"
#         relative_url = "/docs/apps/"
# Output: "https://waha.devlike.pro/docs/apps/"

Changes Made

Added imports: urljoin and urlparse from urllib.parse
Enhanced parsing logic:
- Detects if extracted URL is absolute (has scheme + netloc)
- If relative, composes absolute URL using urljoin(sitemap_url, relative_url)
- Validates composed URLs have valid HTTP(S) scheme and netloc
- Skips invalid URLs with warnings
Improved logging: Debug logs for composed URLs

🧪 Testing

Comprehensive Test Suite

Added test_sitemap_relative_urls.py with 10 test cases:

Test	Coverage
✅ Absolute URLs	Preserved unchanged
✅ Relative URLs	Composed to absolute
✅ Mixed sitemaps	Both types handled
✅ Subdirectory paths	Parent-relative (`../`) works
✅ Invalid URLs	Gracefully skipped
✅ HTTP errors	Handled gracefully
✅ Network errors	Handled gracefully
✅ Malformed XML	Handled gracefully
✅ Whitespace trimming	URLs cleaned properly
✅ Real-world example	waha.devlike.pro scenario

Test Results

$ pytest tests/test_sitemap_relative_urls.py -v
======================== 10 passed in 1.19s ========================

$ pytest tests/test_url_handler.py -v
======================== 15 passed in 1.05s ========================

Manual Testing

✅ Successfully crawled https://waha.devlike.pro/docs/sitemap.xml ✅ All relative URLs composed correctly ✅ No more "URL must start with 'http://'" errors ✅ Task completes without disappearing from UI

📊 Impact

Before Fix ❌

Sitemaps with relative URLs fail to crawl
Users get cryptic error messages in logs
Tasks disappear from UI without explanation
Common sitemap format not supported

After Fix ✅

Relative URLs automatically composed to absolute
Portable sitemaps now work
Better error handling with warnings
Backward compatible (absolute URLs unchanged)
Clear debug logging

🔍 Files Changed

python/
├── src/server/services/crawling/strategies/
│   └── sitemap.py                           # Modified: +28 lines
└── tests/
    └── test_sitemap_relative_urls.py        # New: 252 lines

✨ Benefits

Broader Compatibility: Supports sitemaps from more generators
Better UX: No cryptic errors for common sitemap formats
Portable Sitemaps: Sites can use relative URLs for flexibility
Robust Validation: Invalid URLs skipped gracefully
Backward Compatible: Existing absolute URLs work unchanged
Better Logging: Debug info for troubleshooting

🔗 Related Issues

Resolves #825

✅ Checklist

[x] Code follows project style guidelines (Ruff, MyPy compliant)
[x] All tests pass (10/10 new tests + 15/15 existing tests)
[x] No regressions in existing functionality
[x] Backward compatible with existing sitemaps
[x] Comprehensive test coverage added
[x] Manual testing completed successfully
[x] Documentation updated (via code comments and docstrings)

📸 Test Evidence

Successfully crawled sitemap that was previously failing:

INFO | Parsing sitemap: https://waha.devlike.pro/docs/sitemap.xml
INFO | Successfully extracted 76 URLs from sitemap (composed relative URLs)
INFO | Inserted batch 4 of 4 code examples

🙏 Additional Notes

This fix addresses the sitemap parsing issue. The UI error handling (tasks disappearing without notification) mentioned in #825 could be addressed in a follow-up PR if desired.

Tested on:

Docker: archon-server container
Python: 3.12.12
All services healthy and functional

Summary by CodeRabbit

Bug Fixes

Improved sitemap parsing to correctly handle both absolute and relative URLs
Relative URLs in sitemaps are now properly converted to absolute URLs based on sitemap context
Enhanced error handling and validation during URL extraction with detailed logging
Better handling of malformed XML, network errors, and HTTP errors in sitemap processing

Oct 27 '25 09:10 thiagomaf

Walkthrough

This change addresses bug #825 by implementing relative URL conversion in sitemap parsing. The parser now uses urljoin to compose absolute URLs from the sitemap base URL and relative paths, while preserving already-absolute URLs. Enhanced error handling includes cancellation checks and per-URL exception handling, with detailed logging of successes and composition attempts.

Changes

Cohort / File(s) Summary

Sitemap Parser Enhancement
python/src/server/services/crawling/strategies/sitemap.py Added URL utilities (urlparse, urljoin) to normalize relative URLs to absolute URLs. Replaces direct text extraction with two-step process: collect raw URLs, then compose and validate relative ones. Introduces optional cancellation check, nested try/except blocks for per-URL processing, and expanded logging for successes, composition attempts, and composition failures. Return type documents and logs absolute URLs.

Relative URL Test Suite
python/tests/test_sitemap_relative_urls.py New comprehensive test module verifying sitemap parsing with mixed absolute/relative URLs. Tests URL preservation, composition, subdirectory/parent-relative resolution, invalid URL filtering, HTTP/XML error handling, whitespace trimming, and real-world scenarios.

Cohort / File(s)	Summary
Sitemap Parser Enhancement `python/src/server/services/crawling/strategies/sitemap.py`	Added URL utilities (`urlparse`, `urljoin`) to normalize relative URLs to absolute URLs. Replaces direct text extraction with two-step process: collect raw URLs, then compose and validate relative ones. Introduces optional cancellation check, nested try/except blocks for per-URL processing, and expanded logging for successes, composition attempts, and composition failures. Return type documents and logs absolute URLs.
Relative URL Test Suite `python/tests/test_sitemap_relative_urls.py`	New comprehensive test module verifying sitemap parsing with mixed absolute/relative URLs. Tests URL preservation, composition, subdirectory/parent-relative resolution, invalid URL filtering, HTTP/XML error handling, whitespace trimming, and real-world scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant parse_sitemap
    participant HTTP
    participant XMLParser
    participant URLNormalizer

    Caller->>parse_sitemap: parse_sitemap(sitemap_url)
    
    note over parse_sitemap: Check for cancellation
    
    parse_sitemap->>HTTP: Fetch sitemap XML (30s timeout)
    HTTP-->>parse_sitemap: XML content or error
    
    alt HTTP Error
        parse_sitemap-->>Caller: Empty/partial list
    else Success
        parse_sitemap->>XMLParser: Parse XML
        XMLParser-->>parse_sitemap: Raw loc elements
        
        loop For each loc
            parse_sitemap->>URLNormalizer: Normalize URL
            
            alt Already absolute (http/https)
                URLNormalizer-->>parse_sitemap: Keep URL as-is (log success)
            else Relative URL
                URLNormalizer->>URLNormalizer: Compose with sitemap base
                
                alt Valid absolute result
                    URLNormalizer-->>parse_sitemap: Absolute URL (log composition)
                else Invalid composition
                    URLNormalizer-->>parse_sitemap: Skip + warn
                end
            else Invalid format
                URLNormalizer-->>parse_sitemap: Skip + debug
            end
        end
        
        parse_sitemap-->>Caller: list[str] (absolute URLs)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas requiring extra attention:

URL composition logic: validate that urljoin correctly handles edge cases (parent-relative paths, subdomories, trailing slashes)
Error handling flow: verify nested try/except blocks properly isolate per-URL errors without breaking overall parsing
Cancellation handling: ensure asyncio.CancelledError check is correctly positioned and doesn't mask other errors
Logging consistency: confirm log messages accurately reflect URL transformation steps and edge case handling
Return contract: validate that all returned URLs are guaranteed to be absolute and valid (http/https with netloc)

Poem

🐰 A sitemap's paths, once lost and small, Now compass-bound, they're found at all! From /docs/ whispers to full URLs bright, Relative becomes absolute—what a sight! ✨

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "Fix #825: Handle relative URLs in sitemap parsing" clearly and directly describes the main change in the pull request. It accurately reflects that this is a bug fix addressing issue #825 and specifically targets the handling of relative URLs in sitemap parsing. The title is concise, free of noise (no emojis or vague terms), and provides sufficient context for a developer scanning the repository history to understand the primary change.
Linked Issues Check	✅ Passed	The pull request successfully addresses the primary coding-related objectives from issue #825. It implements automatic URL composition using `urllib.parse.urljoin()` to convert relative URLs to absolute URLs [#825], validates composed URLs to ensure valid HTTP(S) schemes and netloc [#825], and updates the sitemap parsing logic in `python/src/server/services/crawling/strategies/sitemap.py` [#825]. The PR provides graceful error handling by skipping invalid URLs with warnings and includes comprehensive test coverage (10 new test cases) that validate absolute URLs, relative URLs, mixed sitemaps, parent-relative paths, invalid URLs, and error scenarios. The preferred solution approach (automatic composition) was selected from the issue's options rather than the fallback option. The PR notes that UI-level error visibility improvements are deferred to a potential follow-up PR.
Out of Scope Changes Check	✅ Passed	All changes in the pull request are directly related to the objective of handling relative URLs in sitemap parsing. The modifications to `sitemap.py` include adding URL utility imports, implementing relative-to-absolute URL composition logic, enhancing error handling with try/except blocks and cancellation checks, and improving logging—all of which are necessary to fulfill the PR objectives. The new test file `test_sitemap_relative_urls.py` with 10 test cases comprehensively validates the implemented functionality for absolute URLs, relative URLs, mixed formats, subdirectory paths, invalid URLs, HTTP/network errors, malformed XML, and real-world examples. No extraneous changes outside the scope of relative URL handling in sitemap parsing are evident in the provided summaries.
Description Check	✅ Passed	The PR description is comprehensive and covers all essential information needed to understand the changes, though it deviates from the template format by using custom emoji-based sections instead of the template's checkbox-based structure. The description includes a clear problem statement, solution explanation with examples, detailed testing evidence with 10 test cases, impact analysis, and file changes. While the template's "Type of Change", "Affected Services", and structured "Testing" sections are not filled out exactly as specified, the description compensates by providing thorough contextual information. The custom checklist and additional notes provide equivalent or more detailed content than the template would require.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

[ ] 📝 Generate docstrings

🧪 Generate unit tests (beta)

[ ] Create PR with unit tests
[ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Oct 27 '25 09:10 coderabbitai[bot]