docling icon indicating copy to clipboard operation
docling copied to clipboard

Create a backend to transform USPTO patents (XML and TXT) to DoclingDocument

Open ceberam opened this issue 1 year ago • 0 comments

Requested feature

  • The Docling library defines a DeclarativeDocumentBackend abstract class to transform different document formats to DoclingDocument without a recognition pipeline. Implementations include HTMLDocumentBackend for HTML pages and MsWordDocumentBackend for MS Word documents.
  • The United States Patent and Trademark Office (USPTO) is the federal agency for granting U.S. patents and registering trademarks. The USPTO disseminate public patent and trademark pre-packaged or user-customized bulk data products through the [Bulk Data Storage System.
  • Patent applications and grants are available in several formats. In particular, full text data (no images) are available in XML format and packaged in zip files. Some old grants though are in tabular format (grants from January 1976 till December 2001).

This feature consists of providing a document backend implementation that parses USPTO patent and application content (text) into a docling document.

Alternatives

There are no alternatives at this point, since this is a new feature.

ceberam avatar Dec 16 '24 09:12 ceberam