framework icon indicating copy to clipboard operation
framework copied to clipboard

fix: parallel file validation for a datapackage

Open pierrecamilleri opened this issue 2 months ago • 0 comments

  • fixes #1644

Context

  • In validator.py, Resource was imported as a type, but then used as a class to retrieve a class method (Resource.from_descriptor).
  • The tests would not intercept this error, as they were performed on a schema with a foreign key, and the validation would fall back on synchronous validation in this case.
  • Importing Resource (not as a type) into validator.py leads to a circular import. A small workaround would have been to import Resource locally, but this was not very satisfactory.

Suggested changes and fixes

  • I repaired the tests, using datapackages with no foreign keys to test parallel validation. I checked that these tests would have failed because with the bug.
  • I moved the implementation of Validator.validate_resource and Validator.validate_package methods to the Resource.validate and Package.validate methods. This solves the circular import problem, and removes duplication (method signatures).
  • The tests have been dispatched as well.
  • I soft deprecated the Validator class, keeping the methods (forwarding input parameters) as they are part of the public API. No warning, only a deprecation message in the docstring, and a mention that there is no plan to remove the class in the future.

Note

I could not produce any significant performance gain in my tests, I wonder if there is some shared resources' lock that prevents the validation from effectively running the validation in parallel (but I have not spent much time on this yet).

Archive of explorations
  • [X] Test that it works correctly
    • Works without error, however, I could not produce any significant performance gain in my tests.
  • [X] Check why the tests did not intercept this error
    • The tests for parallel processing are run for a schema with a foreign key - which fall backs to single-core processing. So the validate_parallel function was actually never executed in the tests.
  • [X] See if I can find a more elegant way to deal with the circular import. I actually don't see the benefit of having this validator.py class instead of dispatching the methods to Resource and DataPackage directly, but I may have missed something.

pierrecamilleri avatar Dec 13 '24 16:12 pierrecamilleri