spidermon
spidermon copied to clipboard
Unable to validate date and date-time with jsonschema
After https://github.com/scrapinghub/spidermon/pull/358, the validation of date fields using jsonschema is not working as before. Spidermon was serializing date fields into strings (https://github.com/scrapinghub/spidermon/pull/358/files#diff-7937ac85a30630fe837b9c133f4459ee590680bb5dfce72775db6005f2b45f51L142), so when injected into jsonschema validators, the date and date-time checkers (https://python-jsonschema.readthedocs.io/en/stable/validate/#validating-formats) didn't work as expected if the item contains a datetime.date or a datetime.datetime instance.
Given the code:
import datetime
from jsonschema._format import FormatChecker
from jsonschema.validators import validator_for
from spidermon.contrib.scrapy.pipelines import ItemValidationPipeline
format_checker = FormatChecker()
schema = {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"date": {
"description": "Date of the gazzete",
"type": "string",
"format": "date"
}
},
"required": [
"date",
]
}
validator_cls = validator_for(schema)
validator = validator_cls(schema=schema, format_checker=format_checker)
original_data = {
'date': datetime.date.today()
}
Validating with spidermon 1.20.0
item_adapter = ItemAdapter(original_data)
item_dict = item_adapter.asdict()
>>> errors = validator.iter_errors(item_dict)
>>> [error for error in errors]
<ValidationError: "datetime.date(2023, 9, 19) is not of type 'string'">]
With spidermon 1.17.0
>>> data = ItemValidationPipeline._convert_item_to_dict(_, original_data)
>>> errors = validator.iter_errors(data)
>>> [error for error in errors]
[]
Validating with spidermon 1.20.0
>>> errors = validator.iter_errors(data)
>>> [error for error in errors]
<ValidationError: "datetime.date(2023, 9, 19) is not of type 'string'">]
This change has the potential to break applications that are relying that Spidermon will understand date and datetime values and validate them with jsonschema.
To make it work, the user needs to manually serialize the date and datetime values in the items. But I am trying to figure out if there some solution that could be implemented in Spidermon side, to avoid this manipulation.
cc @VMRuiz @Gallaecio
Hey, sorry for getting back to you late on this. I'm not entirely sure if we should change anything here. If you want your field to be a string with a date format, you could scrape it that way or set up an item pipeline to automatically convert datetime objects into strings if that's easier for you.
I don't think Spidermon should make that decision for you by default. But I'm open to the idea of adding it as an opt-in feature where you can configure auto-casting methods for your fields. It could come in handy, especially when you want to validate with Jsonschema but still keep the original data types, like for binary RPC calls.
What do you think @Gallaecio @curita ?