framework
framework copied to clipboard
fix: parallel file validation for a datapackage
- fixes #1644
Context
- In validator.py,
Resource
was imported as a type, but then used as a class to retrieve a class method (Resource.from_descriptor
). - The tests would not intercept this error, as they were performed on a schema with a foreign key, and the validation would fall back on synchronous validation in this case.
- Importing
Resource
(not as a type) intovalidator.py
leads to a circular import. A small workaround would have been to import Resource locally, but this was not very satisfactory.
Suggested changes and fixes
- I repaired the tests, using datapackages with no foreign keys to test parallel validation. I checked that these tests would have failed because with the bug.
- I moved the implementation of
Validator.validate_resource
andValidator.validate_package
methods to theResource.validate
andPackage.validate
methods. This solves the circular import problem, and removes duplication (method signatures). - The tests have been dispatched as well.
- I soft deprecated the
Validator
class, keeping the methods (forwarding input parameters) as they are part of the public API. No warning, only a deprecation message in the docstring, and a mention that there is no plan to remove the class in the future.
Note
I could not produce any significant performance gain in my tests, I wonder if there is some shared resources' lock that prevents the validation from effectively running the validation in parallel (but I have not spent much time on this yet).
Archive of explorations
- [X] Test that it works correctly
- Works without error, however, I could not produce any significant performance gain in my tests.
- [X] Check why the tests did not intercept this error
- The tests for parallel processing are run for a schema with a foreign key - which fall backs to single-core processing. So the
validate_parallel
function was actually never executed in the tests.
- The tests for parallel processing are run for a schema with a foreign key - which fall backs to single-core processing. So the
- [X] See if I can find a more elegant way to deal with the circular import. I actually don't see the benefit of having this validator.py class instead of dispatching the methods to
Resource
andDataPackage
directly, but I may have missed something.