Fix #825: Handle relative URLs in sitemap parsing
Fix #825: Handle relative URLs in sitemap parsing
π Problem
Sitemap ingestion fails when XML contains relative URLs instead of absolute URLs.
Original Error
[ERROR]... Γ /docs/apps/ | Error:
Unexpected error in _crawl_web at line 500:
Error: URL must start with 'http://', 'https://', 'file://', or 'raw:'
Root Cause: Many sitemap generators output relative URLs (e.g., /docs/apps/) to keep the XML portable across domains. The sitemap parser was extracting these relative URLs as-is without composing them into absolute URLs, causing the crawler to reject them.
Example Problematic Sitemap
<url>
<loc>/docs/apps/</loc>
<lastmod>2020-10-06T08:48:45+00:00</lastmod>
</url>
β Solution
Modified sitemap.py to automatically compose relative URLs to absolute URLs using urllib.parse.urljoin.
How it Works
# Input: sitemap_url = "https://waha.devlike.pro/docs/sitemap.xml"
# relative_url = "/docs/apps/"
# Output: "https://waha.devlike.pro/docs/apps/"
Changes Made
- Added imports:
urljoinandurlparsefromurllib.parse - Enhanced parsing logic:
- Detects if extracted URL is absolute (has scheme + netloc)
- If relative, composes absolute URL using
urljoin(sitemap_url, relative_url) - Validates composed URLs have valid HTTP(S) scheme and netloc
- Skips invalid URLs with warnings
- Improved logging: Debug logs for composed URLs
π§ͺ Testing
Comprehensive Test Suite
Added test_sitemap_relative_urls.py with 10 test cases:
| Test | Coverage |
|---|---|
| β Absolute URLs | Preserved unchanged |
| β Relative URLs | Composed to absolute |
| β Mixed sitemaps | Both types handled |
| β Subdirectory paths | Parent-relative (../) works |
| β Invalid URLs | Gracefully skipped |
| β HTTP errors | Handled gracefully |
| β Network errors | Handled gracefully |
| β Malformed XML | Handled gracefully |
| β Whitespace trimming | URLs cleaned properly |
| β Real-world example | waha.devlike.pro scenario |
Test Results
$ pytest tests/test_sitemap_relative_urls.py -v
======================== 10 passed in 1.19s ========================
$ pytest tests/test_url_handler.py -v
======================== 15 passed in 1.05s ========================
Manual Testing
β
Successfully crawled https://waha.devlike.pro/docs/sitemap.xml
β
All relative URLs composed correctly
β
No more "URL must start with 'http://'" errors
β
Task completes without disappearing from UI
π Impact
Before Fix β
- Sitemaps with relative URLs fail to crawl
- Users get cryptic error messages in logs
- Tasks disappear from UI without explanation
- Common sitemap format not supported
After Fix β
- Relative URLs automatically composed to absolute
- Portable sitemaps now work
- Better error handling with warnings
- Backward compatible (absolute URLs unchanged)
- Clear debug logging
π Files Changed
python/
βββ src/server/services/crawling/strategies/
β βββ sitemap.py # Modified: +28 lines
βββ tests/
βββ test_sitemap_relative_urls.py # New: 252 lines
β¨ Benefits
- Broader Compatibility: Supports sitemaps from more generators
- Better UX: No cryptic errors for common sitemap formats
- Portable Sitemaps: Sites can use relative URLs for flexibility
- Robust Validation: Invalid URLs skipped gracefully
- Backward Compatible: Existing absolute URLs work unchanged
- Better Logging: Debug info for troubleshooting
π Related Issues
Resolves #825
β Checklist
- [x] Code follows project style guidelines (Ruff, MyPy compliant)
- [x] All tests pass (10/10 new tests + 15/15 existing tests)
- [x] No regressions in existing functionality
- [x] Backward compatible with existing sitemaps
- [x] Comprehensive test coverage added
- [x] Manual testing completed successfully
- [x] Documentation updated (via code comments and docstrings)
πΈ Test Evidence
Successfully crawled sitemap that was previously failing:
INFO | Parsing sitemap: https://waha.devlike.pro/docs/sitemap.xml
INFO | Successfully extracted 76 URLs from sitemap (composed relative URLs)
INFO | Inserted batch 4 of 4 code examples
π Additional Notes
This fix addresses the sitemap parsing issue. The UI error handling (tasks disappearing without notification) mentioned in #825 could be addressed in a follow-up PR if desired.
Tested on:
- Docker: archon-server container
- Python: 3.12.12
- All services healthy and functional
Summary by CodeRabbit
Bug Fixes
- Improved sitemap parsing to correctly handle both absolute and relative URLs
- Relative URLs in sitemaps are now properly converted to absolute URLs based on sitemap context
- Enhanced error handling and validation during URL extraction with detailed logging
- Better handling of malformed XML, network errors, and HTTP errors in sitemap processing
Walkthrough
This change addresses bug #825 by implementing relative URL conversion in sitemap parsing. The parser now uses urljoin to compose absolute URLs from the sitemap base URL and relative paths, while preserving already-absolute URLs. Enhanced error handling includes cancellation checks and per-URL exception handling, with detailed logging of successes and composition attempts.
Changes
| Cohort / File(s) | Summary |
|---|---|
Sitemap Parser Enhancement python/src/server/services/crawling/strategies/sitemap.py |
Added URL utilities (urlparse, urljoin) to normalize relative URLs to absolute URLs. Replaces direct text extraction with two-step process: collect raw URLs, then compose and validate relative ones. Introduces optional cancellation check, nested try/except blocks for per-URL processing, and expanded logging for successes, composition attempts, and composition failures. Return type documents and logs absolute URLs. |
Relative URL Test Suite python/tests/test_sitemap_relative_urls.py |
New comprehensive test module verifying sitemap parsing with mixed absolute/relative URLs. Tests URL preservation, composition, subdirectory/parent-relative resolution, invalid URL filtering, HTTP/XML error handling, whitespace trimming, and real-world scenarios. |
Sequence Diagram(s)
sequenceDiagram
participant Caller
participant parse_sitemap
participant HTTP
participant XMLParser
participant URLNormalizer
Caller->>parse_sitemap: parse_sitemap(sitemap_url)
note over parse_sitemap: Check for cancellation
parse_sitemap->>HTTP: Fetch sitemap XML (30s timeout)
HTTP-->>parse_sitemap: XML content or error
alt HTTP Error
parse_sitemap-->>Caller: Empty/partial list
else Success
parse_sitemap->>XMLParser: Parse XML
XMLParser-->>parse_sitemap: Raw loc elements
loop For each loc
parse_sitemap->>URLNormalizer: Normalize URL
alt Already absolute (http/https)
URLNormalizer-->>parse_sitemap: Keep URL as-is (log success)
else Relative URL
URLNormalizer->>URLNormalizer: Compose with sitemap base
alt Valid absolute result
URLNormalizer-->>parse_sitemap: Absolute URL (log composition)
else Invalid composition
URLNormalizer-->>parse_sitemap: Skip + warn
end
else Invalid format
URLNormalizer-->>parse_sitemap: Skip + debug
end
end
parse_sitemap-->>Caller: list[str] (absolute URLs)
end
Estimated code review effort
π― 3 (Moderate) | β±οΈ ~20 minutes
Areas requiring extra attention:
- URL composition logic: validate that
urljoincorrectly handles edge cases (parent-relative paths, subdomories, trailing slashes) - Error handling flow: verify nested try/except blocks properly isolate per-URL errors without breaking overall parsing
- Cancellation handling: ensure
asyncio.CancelledErrorcheck is correctly positioned and doesn't mask other errors - Logging consistency: confirm log messages accurately reflect URL transformation steps and edge case handling
- Return contract: validate that all returned URLs are guaranteed to be absolute and valid (http/https with netloc)
Poem
π° A sitemap's paths, once lost and small, Now compass-bound, they're found at all! From
/docs/whispers to full URLs bright, Relative becomes absoluteβwhat a sight! β¨
Pre-merge checks and finishing touches
β Passed checks (5 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title Check | β Passed | The PR title "Fix #825: Handle relative URLs in sitemap parsing" clearly and directly describes the main change in the pull request. It accurately reflects that this is a bug fix addressing issue #825 and specifically targets the handling of relative URLs in sitemap parsing. The title is concise, free of noise (no emojis or vague terms), and provides sufficient context for a developer scanning the repository history to understand the primary change. |
| Linked Issues Check | β Passed | The pull request successfully addresses the primary coding-related objectives from issue #825. It implements automatic URL composition using urllib.parse.urljoin() to convert relative URLs to absolute URLs [#825], validates composed URLs to ensure valid HTTP(S) schemes and netloc [#825], and updates the sitemap parsing logic in python/src/server/services/crawling/strategies/sitemap.py [#825]. The PR provides graceful error handling by skipping invalid URLs with warnings and includes comprehensive test coverage (10 new test cases) that validate absolute URLs, relative URLs, mixed sitemaps, parent-relative paths, invalid URLs, and error scenarios. The preferred solution approach (automatic composition) was selected from the issue's options rather than the fallback option. The PR notes that UI-level error visibility improvements are deferred to a potential follow-up PR. |
| Out of Scope Changes Check | β Passed | All changes in the pull request are directly related to the objective of handling relative URLs in sitemap parsing. The modifications to sitemap.py include adding URL utility imports, implementing relative-to-absolute URL composition logic, enhancing error handling with try/except blocks and cancellation checks, and improving loggingβall of which are necessary to fulfill the PR objectives. The new test file test_sitemap_relative_urls.py with 10 test cases comprehensively validates the implemented functionality for absolute URLs, relative URLs, mixed formats, subdirectory paths, invalid URLs, HTTP/network errors, malformed XML, and real-world examples. No extraneous changes outside the scope of relative URL handling in sitemap parsing are evident in the provided summaries. |
| Description Check | β Passed | The PR description is comprehensive and covers all essential information needed to understand the changes, though it deviates from the template format by using custom emoji-based sections instead of the template's checkbox-based structure. The description includes a clear problem statement, solution explanation with examples, detailed testing evidence with 10 test cases, impact analysis, and file changes. While the template's "Type of Change", "Affected Services", and structured "Testing" sections are not filled out exactly as specified, the description compensates by providing thorough contextual information. The custom checklist and additional notes provide equivalent or more detailed content than the template would require. |
| Docstring Coverage | β Passed | Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%. |
β¨ Finishing touches
- [ ] π Generate docstrings
π§ͺ Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.