ckanext-validation
ckanext-validation copied to clipboard
Allow upload of invalid resource data
Currently ckanext-validation doesn't allow uploading resources which don’t pass validation. If the validation fails the extension tries to remove a failed resource file. But there are cases where CKAN should be allowed to store invalid data. In cases like asynchronous bulk upload or large tabular data compiled by multiple parties the validation report is important for data curators to know what’s going on in the system.
This could be solved by adding extra configuration option:
ckanext.validation.allow_invalid_data = False
This would let us to modify logic in ckanext/validation/logic.py (https://github.com/frictionlessdata/ckanext-validation/blob/1073c80dace453a404df3d5f1ac1ae86a88b5029/ckanext/validation/logic.py#L665) to preserve data file in case of validation error e.g:
allow_invalid_data = bool(t.config.get(
'ckanext.validation.allow_invalid_data'
))
if not report['valid'] and not allow_invalid_data:
...
I’ll gladly provide a PR for this if you think this is something useful. It’s a requirement of our users, so we’re currently using it in a forked version of ckanext-validation
I was under the impression that if only asynchronous validation is used, then the desired behaviour is already the case; you can upload an invalid resource and just get a report on the failures.
It's when you enable synchronous validation that uploads will be rejected.
You're right, I've missed this, I was testing v2.0.0 in synchronous mode. Wouldn't it make more sense to have the same behavior for both async and sync mode?
- Either a configurable option like
ckanext.validation.allow_invalid_data_in_sync_mode. - Or just enable invalid data upload in sync mode permanently, so it's consistent with the asynchronous mode?
Wouldn't it make more sense to have the same behavior for both async and sync mode?
I don't think so. They're not just performance differences, they actually give you different feedback. Synchronous validation can actually give you an error, whereas async just gives you a report.
If you have chosen not to validate resources until after upload is complete, then clearly you are okay with potentially invalid resources existing in your system.
If you have chosen to take the performance hit to validate potentially large resources on the spot, then clearly you care a lot about ensuring that they're valid, and it makes sense to deny invalid uploads.
It's like the difference between having metal detectors and X-raying luggage, vs just having security cameras and doing background checks. One has the capability and expectation to actually deny entry to those who fail, the other doesn't.
In our case we have users that are collaborating on the uploaded data. Some files can be incomplete (e.g. incomplete patient surveys etc.) and the gaps will be filled by other users. I understand this can be achieved with asynchronous mode, but I think this behavior should be available in both modes.
Hello @ThrawnCA . I'm working together with @fulior on the issue. Thanks a lot for your valuable feedback and pointing out the differences between sync and async validation regarding data file deletion. This resolves one of our use cases. However, I still find the option to keep invalid data uploaded to the system with sync validation sensible. Two big reasons behind it (I wonder what you think?):
- synchronous validation is easier to work with on local dev env, and it would be nice to reproduce the production behaviour of keeping invalid data in the system with the validation report
- a use case where data is curated by multiple users and someone would like to upload a partially valid file to CKAN (with synchronous validation) so other parties can update the file further (e.g. adding missing column values).
ckanext.validation.allow_invalid_data could be False by default and hence preserve old behaviour. It is not a big logic change and would make our configuration and workflow much more consistent.
synchronous validation is easier to work with on local dev env
How so? Wouldn't it be more important to replicate production behaviour? With a local database, I would expect that validation would complete in seconds at most, so there shouldn't be any particular difficulties in testing it.
a use case where data is curated by multiple users and someone would like to upload a partially valid file to CKAN (with synchronous validation) so other parties can update the file further (e.g. adding missing column values).
Why couldn't this use asynchronous validation? All you need is the report to tell you when you've successfully fixed it, right? So asynchronous validation works fine.