juriscraper icon indicating copy to clipboard operation
juriscraper copied to clipboard

Add validations for case names, dates, and download URLs in _sanity_check

Open Luis-manzur opened this issue 8 months ago • 2 comments

This pull request introduces new validations to the _check_sanity method in AbstractSite to enhance data integrity checks and improve error handling.

Enhancements to _check_sanity validations:

  • Added checks for suspicious file extensions in download_urls using a regular expression to detect potentially unsafe or unexpected file types.
  • Introduced validation for forbidden characters in case_names, logging warnings when detected.
  • Added a new sanity check to ensure case_dates are not earlier than the year 1900, raising an exception for invalid dates.

Luis-manzur avatar Jul 31 '25 17:07 Luis-manzur

Looks like we added validation for url endings --- can we remove this please and focus just on the validation for dates

flooie avatar Aug 25 '25 13:08 flooie

I think I would want to add more tests and do further research into the other components and I think it complicates this PR.

flooie avatar Aug 25 '25 13:08 flooie