Issue: Enhance Sponsor Scraping System with Automation and Robustness
Overview
This issue outlines the planned improvements to the sponsor scraping system, started in PR #844 by @marcoroth and @brauliomartinezlm. I want to see if a can make it more automated, reliable, and maintainable. Currently, sponsor extraction requires manual intervention and lacks error handling and testing.
Current State
The sponsor scraping system (app/lib/download_sponsors.rb) uses Capybara + Cuprite + ActiveGenie with AI to extract sponsor data from event websites. However, it requires manual execution and lacks automation for bulk processing.
Planned Improvements
High Priority
-
Create automation script for bulk sponsor downloads
- Reason: Currently, sponsors must be downloaded manually for each event using
event.sponsors_file.download. This creates a bottleneck in the event onboarding process and leads to inconsistent sponsor data across events. A bulk automation script would streamline the workflow and ensure all events have up-to-date sponsor information. - Implementation: Create
scripts/download_sponsors.rbthat processes all events without existing sponsor files - Impact: Reduces manual work, improves data consistency, speeds up event onboarding
- Reason: Currently, sponsors must be downloaded manually for each event using
-
Add comprehensive tests for scraping system
- Unit tests for
DownloadSponsorsclass - Integration tests for sponsor extraction workflow
- Mock tests for web scraping scenarios
- Unit tests for
-
Improve error handling and logging
- Add robust error handling for network failures
- Add retry mechanisms for transient failures
Medium Priority
-
Optimize AI prompts for better extraction (If possible)
- Refine ActiveGenie schemas for improved accuracy
- Add validation for extracted data quality
-
Add data validation
- Post-processing validation of extracted sponsor data
- Duplicate detection and merging