docling icon indicating copy to clipboard operation
docling copied to clipboard

feat: create a backend to parse USPTO patents into DoclingDocument

Open ceberam opened this issue 11 months ago • 1 comments

Resolves #605

This PR implements the following changes:

  • Add a backend implementation to parse patent applications and grants from the United States Patent Office (USPTO).
  • Refactor the module docling/datamodel/document.py to address the scenario of multiple InputFormat instances with the same mime type. In particular, add a function that further examines part of an input document to guess the InputFormat instance to use for the conversion.

This PR is intended to be merged after #557

Checklist:

  • [x] Documentation has been updated, if necessary.
  • [x] Examples have been added, if necessary.
  • [x] Tests have been added, if necessary.

Limitations:

The following points will need to be addressed in later PRs:

  • Slightly refactor guess_format function in docling/datamodel/document.py module, once we support another XML InputFormat, since application/xml mime type will already be ambiguous.
  • Add an abstract static method in abstract_bakend.py that examines a partial content of a document and determines if the backend implementation supports a document type with that content. This function could then be called in docling/datamodel/document.py module and avoid duplicated code when disambiguating mime types.
  • Add documentation and notebook examples.
  • Eventually create a default text/plain backend parser.

ceberam avatar Dec 16 '24 09:12 ceberam

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • [X] #approved-reviews-by >= 2

mergify[bot] avatar Dec 16 '24 09:12 mergify[bot]