extractous icon indicating copy to clipboard operation
extractous copied to clipboard

When a document(.doc) contains a Visio graphic, the extraction fails

Open nilcodes opened this issue 1 year ago • 1 comments

Exception in thread "main": java.lang.ExceptionInInitializerError java.lang.ExceptionInInitializerError at org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocument.java:68) at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(XDGFVisioExtractor.java:40) at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:221) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:206) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109) at org.apache.tika.extractor.EmbeddedDocumentUtil.parseEmbedded(EmbeddedDocumentUtil.java:240) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:143) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:191) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:156) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:227) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:230) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) at ai.yobix.TikaNativeMain.parseToStringWithConfig(TikaNativeMain.java:180) at ai.yobix.TikaNativeMain.parseBytesToString(TikaNativeMain.java:148) Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/visiodocumentd431doctype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.visiodocumentd431doctype) - code 0 at org.apache.xmlbeans.impl.schema.XsbReader.(XsbReader.java:63) at org.apache.xmlbeans.impl.schema.SchemaTypeSystemImpl.resolveHandle(SchemaTypeSystemImpl.java:935) at org.apache.xmlbeans.impl.schema.ElementFactory.(ElementFactory.java:29) at org.apache.xmlbeans.impl.schema.AbstractDocumentFactory.(AbstractDocumentFactory.java:33) at org.apache.xmlbeans.impl.schema.DocumentFactory.(DocumentFactory.java:23) at com.microsoft.schemas.office.visio.x2012.main.VisioDocumentDocument1.(VisioDocumentDocument1.java:23) ... 22 more

Exception in thread "main": java.lang.NoClassDefFoundError java.lang.NoClassDefFoundError: Could not initialize class com.microsoft.schemas.office.visio.x2012.main.VisioDocumentDocument1 at org.apache.poi.xdgf.usermodel.XmlVisioDocument.(XmlVisioDocument.java:68) at org.apache.poi.xdgf.extractor.XDGFVisioExtractor.(XDGFVisioExtractor.java:40) at org.apache.poi.ooxml.extractor.POIXMLExtractorFactory.create(POIXMLExtractorFactory.java:221) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:206) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:118) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109) at org.apache.tika.extractor.EmbeddedDocumentUtil.parseEmbedded(EmbeddedDocumentUtil.java:240) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:143) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:191) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:156) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:227) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:230) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) at ai.yobix.TikaNativeMain.parseToStringWithConfig(TikaNativeMain.java:180) at ai.yobix.TikaNativeMain.parseBytesToString(TikaNativeMain.java:148)

nilcodes avatar Dec 13 '24 09:12 nilcodes

Thanks for reporting this issue. Is it possible to provide the file that caused this. if not we'll find a way to replicate this.

nmammeri avatar Dec 13 '24 11:12 nmammeri