Add date format validation to `test_extract_from_text_properly_implemented` on test_ScraperExtractFromTextTest.py
We had an error on courlistener when extracting date_filed using extract_from_text from recently added bap1
{
"OpinionCluster": {"date_filed": "July 29, 2022"},
},
...
File “/opt/courtlistener/cl/scrapers/tasks.py”, line 179, in extract_doc_content
opinion.cluster.save(index=False)
....
django.core.exceptions.ValidationError: [‘“July 29, 2022" value has an invalid date format. It must be in YYYY-MM-DD format.‘]
The function test_extract_from_text_properly_implemented on test_ScraperExtractFromTextTest.py should force the user to use the proper format when dealing with date fields
I'm going to go ahead and just push the update/fix for BAP1 but I like the idea of implementing the correct format. Thanks for tackling this.
I put this in the PR, but I think the proper fix for this is to do a json schema for our outputs. It'd help folks understand the code too, if we had schemas for all our scrapers that had to pass all tests before PRs were merged.
Another related error from lack of validation
https://freelawproject.sentry.io/issues/4772622463/?project=5257254&query=is%3Aunresolved&referrer=issue-stream&statsPeriod=14d&stream_index=8
As you say @mlissner a json schema to validate would definetly help
Yeah, let's get that prioritized. It shouldn't be terribly hard. Maybe a day or two, I'd guess.
I have been trying this implementation (docs) which seems like a healthy project
There is a small sample schema for the scrapers here
validation_schema = {
"type": "object",
"properties": {
"case_names": {"type": "string"},
"case_dates": {"type": "string", "format": "date-time"},
"download_urls": {"type": "string"},
"precedential_statuses": {"enum": ["Published", "Unpublished"]},
"blocked_statuses": {"type": "boolean"},
"date_filed_is_approximate": {"type": "boolean"},
"citation": {"type": "string"},
"docket": {"type": "string"},
},
"required": [
"case_dates",
"case_names",
"download_urls",
"precedential_statuses",
"date_filed_is_approximate",
],
}
from jsonschema import Draft7Validator, FormatChecker
validator = Draft7Validator(validation_schema, format_checker=FormatChecker())
validator.validate({...})
Some nice things:
- support for "enum" / limited options: see "precedential_statuses"
- support for "required" fields
- flexible type checking, for example,
date-timestrings - extensible validators for custom value types. This could be used for deferred values that are functions until consumed
In the end I think it will be faster doing these schemas by hand, since, at least for the scrapers, the scraped field names that Courtlistener expects are different from the model names proper, so changing that on the scraping side would require changes on the CL side
This schema validation could replace a part of AbstractSite._check_sanity. A separate schema can be created for the ouput of extract_from_text() functions
Looks great to me.