Issue: Enhance Sponsor Scraping System with Automation and Robustness

Open DiegoPisa2003 opened this issue 5 months ago • 0 comments

Overview

This issue outlines the planned improvements to the sponsor scraping system, started in PR #844 by @marcoroth and @brauliomartinezlm. I want to see if a can make it more automated, reliable, and maintainable. Currently, sponsor extraction requires manual intervention and lacks error handling and testing.

Current State

The sponsor scraping system (app/lib/download_sponsors.rb) uses Capybara + Cuprite + ActiveGenie with AI to extract sponsor data from event websites. However, it requires manual execution and lacks automation for bulk processing.

Planned Improvements

High Priority

Create automation script for bulk sponsor downloads
- Reason: Currently, sponsors must be downloaded manually for each event using event.sponsors_file.download. This creates a bottleneck in the event onboarding process and leads to inconsistent sponsor data across events. A bulk automation script would streamline the workflow and ensure all events have up-to-date sponsor information.
- Implementation: Create scripts/download_sponsors.rb that processes all events without existing sponsor files
- Impact: Reduces manual work, improves data consistency, speeds up event onboarding
Add comprehensive tests for scraping system
- Unit tests for DownloadSponsors class
- Integration tests for sponsor extraction workflow
- Mock tests for web scraping scenarios
Improve error handling and logging
- Add robust error handling for network failures
- Add retry mechanisms for transient failures

Medium Priority

Optimize AI prompts for better extraction (If possible)
- Refine ActiveGenie schemas for improved accuracy
- Add validation for extracted data quality
Add data validation
- Post-processing validation of extracted sponsor data
- Duplicate detection and merging

Aug 08 '25 18:08 DiegoPisa2003