docling
docling copied to clipboard
feat: create a backend to parse USPTO patents into DoclingDocument
Resolves #605
This PR implements the following changes:
- Add a backend implementation to parse patent applications and grants from the United States Patent Office (USPTO).
- Refactor the module
docling/datamodel/document.pyto address the scenario of multipleInputFormatinstances with the same mime type. In particular, add a function that further examines part of an input document to guess theInputFormatinstance to use for the conversion.
This PR is intended to be merged after #557
Checklist:
- [x] Documentation has been updated, if necessary.
- [x] Examples have been added, if necessary.
- [x] Tests have been added, if necessary.
Limitations:
The following points will need to be addressed in later PRs:
- Slightly refactor
guess_formatfunction indocling/datamodel/document.pymodule, once we support another XML InputFormat, sinceapplication/xmlmime type will already be ambiguous. - Add an abstract static method in
abstract_bakend.pythat examines a partial content of a document and determines if the backend implementation supports a document type with that content. This function could then be called indocling/datamodel/document.pymodule and avoid duplicated code when disambiguating mime types. - Add documentation and notebook examples.
- Eventually create a default
text/plainbackend parser.
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
🟢 Require two reviewer for test updates
Wonderful, this rule succeeded.
When test data is updated, we require two reviewers
- [X]
#approved-reviews-by >= 2