validate-xml-rust
validate-xml-rust copied to clipboard
A lot of improvements. Thanks Claude.
PR Type
Enhancement, Tests
Description
-
Implements a comprehensive XML validation tool with hybrid async/sync architecture and concurrent file processing
-
Core validation engine: Async/sync validation with tokio-based concurrency control, semaphore-bounded parallelism, and comprehensive result aggregation with performance metrics
-
Configuration management: Multi-source configuration support (TOML/JSON files, environment variables, CLI arguments) with validation and precedence handling
-
Schema handling: Async schema loading with regex caching, support for both local and remote schemas, and two-tier caching system (in-memory and disk-based)
-
File discovery: Async file discovery engine with extension filtering, glob pattern matching, and symlink handling
-
Output system: Multiple output formatters (human-readable, JSON, summary) with progress tracking and configurable verbosity levels
-
LibXML2 integration: Safe Rust wrapper around libxml2 FFI with thread-safe schema handling using Arc for concurrent access
-
HTTP client: Network schema retrieval with retry logic, exponential backoff, and connection pooling
-
Comprehensive testing: Extensive unit tests, integration tests, end-to-end tests, performance benchmarks, and mock implementations for testing without external dependencies
-
CLI interface: Command-line argument parsing and validation using clap with support for extensions, threading, caching, and output formats
Diagram Walkthrough
flowchart LR
CLI["CLI Arguments<br/>clap parsing"]
CONFIG["Configuration<br/>TOML/JSON/ENV"]
DISCOVERY["File Discovery<br/>async patterns"]
VALIDATOR["Validation Engine<br/>async/sync hybrid"]
SCHEMA["Schema Loader<br/>local/remote"]
CACHE["Two-tier Cache<br/>memory/disk"]
LIBXML["LibXML2 Wrapper<br/>thread-safe"]
HTTP["HTTP Client<br/>retry logic"]
OUTPUT["Output System<br/>multiple formats"]
CLI --> CONFIG
CONFIG --> DISCOVERY
DISCOVERY --> VALIDATOR
VALIDATOR --> SCHEMA
SCHEMA --> CACHE
SCHEMA --> HTTP
CACHE --> LIBXML
HTTP --> LIBXML
VALIDATOR --> OUTPUT
File Walkthrough
| Relevant files | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 8 files
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Tests | 14 files
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Additional files | 39 files
|
Summary by CodeRabbit
Release Notes
-
New Features
- High-performance two-tier schema caching (in-memory and disk) with configurable TTL and size limits
- Multiple output formats: human-readable, JSON, and summary reporting
- Configuration file support (TOML/JSON) with environment variable and CLI overrides
- Concurrent validation with bounded concurrency and configurable thread counts
- Progress reporting during validation runs
- Remote schema downloading with automatic retry logic and exponential backoff
-
Documentation
- Comprehensive user guide covering installation, quick start, and CI/CD integration
- Expanded README with features, architecture details, performance benchmarks, and troubleshooting
Changed Files
Walkthrough
The PR transforms validate-xml from a basic tool into a production-grade async XML schema validator. Changes introduce modular architecture with file discovery, async HTTP client, two-tier caching, configuration management, error handling, output formatting, schema extraction, and comprehensive test coverage across 60+ new files and modules.
Changes
| Cohort / File(s) | Summary |
|---|---|
CI/CD Configuration .github/dependabot.yml, .github/workflows/ci.yml, .travis.yml |
Removed Travis config; added GitHub Actions CI with matrix builds across ubuntu/macos/windows, dependency caching, libxml2 setup, and Rust toolchain checks. |
Project Metadata Cargo.toml, deny.toml, LICENSE |
Bumped version to 0.2.0; updated Rust edition to 2024; pinned dependencies with explicit versions; added MIT license; configured cargo-deny for security/license scanning. |
Documentation README.md, CLAUDE.md, docs/USER_GUIDE.md |
Expanded README with features, benchmarks, integration guides; added CLAUDE.md for development context; introduced USER_GUIDE.md with installation, usage, and troubleshooting. |
Core Library Modules src/lib.rs, src/main.rs |
Established lib.rs as public API bootstrap with module re-exports; rewrote main.rs to async Tokio-based entry point with orchestration logic. |
Error & Configuration src/error.rs, src/config.rs |
Introduced unified error types (ValidationError, ConfigError, CacheError, NetworkError, LibXml2Error); added ConfigManager with file/env/CLI precedence and validation. |
HTTP & Caching src/http_client.rs, src/cache.rs |
Implemented AsyncHttpClient with retry/backoff; added two-tier SchemaCache (memory via Moka + disk via cacache) with metadata and TTL support. |
File Processing src/file_discovery.rs, src/schema_loader.rs |
Added FileDiscovery for async regex-based pattern filtering; implemented SchemaExtractor and SchemaLoader for local/remote schema resolution. |
Validation & Output src/validator.rs, src/libxml2.rs, src/output.rs, src/error_reporter.rs, src/cli.rs |
Built ValidationEngine orchestrating discovery/loading/validation; wrapped LibXML2 FFI for thread-safe schema validation; created output formatters (Human/JSON/Summary); added ErrorReporter and Clap-based CLI. |
Test Infrastructure tests/lib.rs, tests/common/mod.rs, tests/common/mocks.rs, tests/common/test_helpers.rs, tests/benchmarks/mod.rs |
Established test module structure; added comprehensive mocks (HTTP, filesystem, cache, validation); created performance timer and fixture utilities. |
Unit Tests tests/unit/mod.rs, tests/unit/*.rs |
Added unit tests for cache, config, error, file discovery, output, schema loader, validation components. |
Integration Tests tests/integration/mod.rs, tests/integration/*.rs, tests/*_integration_test.rs |
Created end-to-end validation workflow tests, output format validation, CLI integration, HTTP client, LibXML2, and file discovery integration tests. |
Performance Tests tests/benchmarks/performance_benchmarks.rs, tests/comprehensive_test_suite.rs, tests/working_comprehensive_tests.rs |
Implemented async benchmark suite for validation speed, caching, discovery, concurrency; added comprehensive performance and correctness test suites. |
Test Fixtures tests/fixtures/configs/*.toml, tests/fixtures/schemas/local/*.xsd, tests/fixtures/xml/**/*.xml |
Added configuration templates (default.toml, performance.toml); created XSD schemas (simple.xsd, complex.xsd, strict.xsd); included valid, invalid, and malformed XML samples. |
Sequence Diagram(s)
sequenceDiagram
participant CLI
participant ConfigMgr
participant FileDiscovery
participant SchemaLoader
participant Cache
participant HttpClient
participant LibXML2
participant OutputWriter
CLI->>ConfigMgr: load_config(cli_args)
ConfigMgr-->>CLI: Config (file/env/CLI merged)
CLI->>FileDiscovery: discover_files(directory)
FileDiscovery-->>CLI: Vec<PathBuf>
CLI->>LibXML2: new()
LibXML2->>LibXML2: initialize (std::sync::Once)
LibXML2-->>CLI: LibXml2Wrapper
loop For each file
CLI->>SchemaLoader: load_schema_for_file(path)
alt Schema in Cache
SchemaLoader->>Cache: get(schema_url)
Cache-->>SchemaLoader: CachedSchema (from memory or disk)
else Schema not cached
SchemaLoader->>HttpClient: download_schema(url)
HttpClient-->>SchemaLoader: Vec<u8> (with retries/backoff)
SchemaLoader->>Cache: set(schema_url, data)
Cache-->>SchemaLoader: CachedSchema
end
SchemaLoader-->>CLI: Arc<CachedSchema>
CLI->>LibXML2: validate_file(path, schema)
LibXML2-->>CLI: ValidationResult
CLI->>OutputWriter: write_file_result(result)
end
CLI->>OutputWriter: write_summary(results)
OutputWriter-->>CLI: formatted output
Estimated code review effort
🎯 5 (Critical) | ⏱️ ~120 minutes
Key areas requiring detailed attention:
- src/libxml2.rs: FFI safety, thread-safety guarantees with
std::sync::Once, unsafe blocks for pointer conversions, and Drop implementation - src/cache.rs: Two-tier coherence logic, expiration semantics, concurrent access patterns, and metadata serialization
- src/validator.rs: Concurrency model via semaphore, timeout handling, progress callback mechanics, and result aggregation across async tasks
- src/config.rs: Precedence logic (file → env → CLI), validation constraints, and type-safe conversions across domains
- Cargo.toml dependency changes: Version pinning strategy, feature flag combinations (especially tokio, moka, cacache), and security/license implications via deny.toml
- src/http_client.rs: Exponential backoff calculation, retryable error classification, and streaming progress callback semantics
- Test fixtures and integration tests: Ensure fixture schemas align with validator expectations and concurrent test execution doesn't create race conditions
- GitHub Actions workflow: Matrix setup correctness, libxml2 installation portability across OS variants, and caching layer effectiveness
Poem
🐰 From sync to async, this code hops high, With caches that layer and schemas that fly! Two tiers of speed, threads dancing concurrently, LibXML2 wrapped safe—validation runs fluently! Tests blooming like clover across every path, A validator reborn from the developer's wrath. ✨
Pre-merge checks and finishing touches
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Title check | ⚠️ Warning | The title 'A lot of improvements. Thanks Claude.' is vague and generic, providing no meaningful information about the actual changes in the PR. | Replace with a specific, descriptive title that highlights the main change, such as 'Implement comprehensive async XML validator with two-tier caching and CLI' or 'Add full XML validation engine with schema loading, caching, and async I/O'. |
✅ Passed checks (1 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
✨ Finishing touches
- [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
- [ ] Commit unit tests in branch
001-xml-validation
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
| Diff | Package | Supply Chain Security |
Vulnerability | Quality | Maintenance | License |
|---|---|---|---|---|---|---|
| reqwest@0.12.12 ⏵ 0.12.24 | ||||||
| anyhow@1.0.100 | ||||||
| serde@1.0.217 ⏵ 1.0.228 | ||||||
| moka@0.12.11 | ||||||
| libc@0.2.169 ⏵ 0.2.177 | ||||||
| regex@1.11.1 ⏵ 1.12.2 | ||||||
| clap@4.5.23 ⏵ 4.5.51 | ||||||
| atty@0.2.14 | ||||||
| tokio-stream@0.1.17 | ||||||
| futures@0.3.31 | ||||||
| num_cpus@1.17.0 | ||||||
| tokio-test@0.4.4 | ||||||
| mockall@0.13.1 | ||||||
| async-trait@0.1.89 | ||||||
| uuid@1.18.1 | ||||||
| chrono@0.4.42 | ||||||
| cacache@13.1.0 | ||||||
| toml@0.9.8 | ||||||
| dirs@5.0.1 ⏵ 6.0.0 | ||||||
| ignore@0.4.23 ⏵ 0.4.25 |
PR Compliance Guide 🔍
Below is a summary of compliance checks for this PR:
| Security Compliance | |
| ⚪ | Terminal escape injectionDescription: Use of ANSI escape sequences for colorization without sanitization could cause misleading Referred Code
|
Resource exhaustionDescription: Unbounded task spawning per file with user-controlled input paths may lead to resource Referred Code
| |
Env config trustDescription: Environment variable 'VALIDATE_XML_FORMAT' is parsed without strict validation beyond Referred Code
| |
| Ticket Compliance | |
| ⚪ | 🎫 No ticket provided
|
| Codebase Duplication Compliance | |
| ⚪ | Codebase context is not definedFollow the guide to enable codebase context checks. |
| Custom Compliance | |
| 🟢 |
Generic: Meaningful Naming and Self-Documenting CodeObjective: Ensure all identifiers clearly express their purpose and intent, making code Status: Passed |
Generic: Secure Logging PracticesObjective: To ensure logs are useful for debugging and auditing without exposing sensitive Status: Passed | |
| ⚪ | Generic: Comprehensive Audit TrailsObjective: To create a detailed and reliable record of critical system actions for security analysis Status: Referred Code
|
Generic: Robust Error Handling and Edge Case ManagementObjective: Ensure comprehensive error handling that provides meaningful context and graceful Status: Referred Code
| |
Generic: Secure Error HandlingObjective: To prevent the leakage of sensitive system information through error messages while Status: Referred Code
| |
Generic: Security-First Input Validation and Data HandlingObjective: Ensure all data inputs are validated, sanitized, and handled securely to prevent Status: Referred Code
| |
Compliance status legend
🟢 - Fully Compliant🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label
CI Feedback 🧐
A test triggered by this PR failed. Here is an AI-generated analysis of the failure:
|
Action: Build and Test (windows-latest) |
|
Failed stage: Install libxml2 [❌] |
|
Failed test name: "" |
|
Failure summary: The action failed during the dependency setup step on Windows: |
Relevant error logs:
|
PR Code Suggestions ✨
Explore these optional code suggestions:
| Category | Suggestion | Impact |
| High-level |
Adopt standard crates for common tasksReplace custom-built components with standard, community-vetted libraries. For Examples:src/config.rs [187-551]
src/file_discovery.rs [7-193]
Solution Walkthrough:Before:
After:
Suggestion importance[1-10]: 9__ Why: This is a critical architectural suggestion that correctly identifies two major areas ( | High |
| Possible issue |
Fix ineffective fail-fast and error handlingRefactor the
Suggestion importance[1-10]: 8__ Why: The suggestion correctly identifies that the | Medium |
Handle non-UTF-8 file pathsImprove file path handling to support non-UTF-8 characters by using
Suggestion importance[1-10]: 8__ Why: The suggestion correctly identifies that using | Medium | |
Fix incorrect configuration merge logicCorrect the configuration merging logic in
Suggestion importance[1-10]: 7__ Why: The suggestion correctly identifies a significant flaw in the configuration merging logic where default values from an override config can unintentionally replace explicitly set values in the base config. The proposed fix of comparing against default values before merging is a valid approach to solve this problem, preventing incorrect configuration states. | Medium | |
Prevent panic on path parent resolutionPrevent a potential panic in src/schema_loader.rs [141-154]
Suggestion importance[1-10]: 7__ Why: The suggestion correctly identifies a potential panic in | Medium | |
Fix cache inconsistency on readIn
Suggestion importance[1-10]: 7__ Why: The suggestion correctly identifies that orphaned metadata files can occur and proposes a good improvement to clean them up during read operations, enhancing cache consistency. | Medium | |
Simplify initial directory traversal logicSimplify the
Suggestion importance[1-10]: 7__ Why: The suggestion correctly identifies that the initial traversal logic in | Medium | |
| General |
Clean up orphaned cache entriesUpdate the
Suggestion importance[1-10]: 7__ Why: The suggestion correctly identifies a scenario that leads to orphaned cache files and proposes a robust fix to improve the cache cleanup logic, preventing wasted disk space. | Medium |
| ||