docling icon indicating copy to clipboard operation
docling copied to clipboard

Create a backend to transform XML files to DoclingDocument

Open ceberam opened this issue 1 year ago • 0 comments

Requested feature

The Docling library defines a DeclarativeDocumentBackend abstract class to transform different document formats to DoclingDocument without a recognition pipeline. Implementations include HTMLDocumentBackend for HTML pages and MsWordDocumentBackend for MS Word documents.

This feature request is about creating a generic backend for XML documents. The backend could leverage an XML Schema Definition (XSD) file to infer the structure of the document. Even though XML format is very generic, there may exist open-source libraries that already implement some cases or some widely adopted publication formats. For instance, the tei2html library is a collection of style sheets to transform a document encoded in TEI to HTML, which can already be addressed by Docling's HTML backend.

To provide more specific transformations, this feature could include examples of XML formats widely used in the open-source community. For instance, the Journal Article Tag Suite (JATS) XML format used to describe scientific literature, like PubMed Central, or the USPTO XML format according to the Patent Grant International Common Element (ICE) DTD for patents grants and applications.

Alternatives

There are no alternatives at this point, since this is a new feature.

ceberam avatar Nov 26 '24 17:11 ceberam